Re: [PATCH] target: Update copyright ownership to 2012
On Fri, Nov 9, 2012 at 3:00 PM, Nicholas A. Bellinger wrote: > This patch to update copyright year to current for principal target core > ownership is now being pushed into target-pending/for-next. Pardon me, but you were just publicly accused of violating the GPL, so your response is to send a patch removing the copyright notices of all other organizations from the SCSI-target code? Have you obtained ownership of all the relevant copyrights for Linux-iSCSI.org, PyX Technologies, Inc, and SBE, Inc? If not, then this patch is an attempted violation of those organizations copyrights and of the GPL (which requires that you preserve copyright notices). Further, while these notices are the only ones listed in those files, they are not the only individuals outside of RisingTide Systems which have significant copyright interest in this code. If your goal is to obtain exclusive copyright ownership over this code then there are a great many other people you must contact and convince first. I would encourage you to talk privately with the Software Freedom Conservancy before sending more patches of this nature. Cheers, Kyle Moffett > diff --git a/drivers/target/target_core_alua.c > b/drivers/target/target_core_alua.c > - * Copyright (c) 2009-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_configfs.c > b/drivers/target/target_core_configfs.c > - * Copyright (c) 2008-2011 Linux-iSCSI.org > diff --git a/drivers/target/target_core_device.c > b/drivers/target/target_core_device.c > - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005-2006 SBE, Inc. All Rights Reserved. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_fabric_configfs.c > b/drivers/target/target_core_fabric_configfs.c > - * Copyright (c) 2010,2011 Linux-iSCSI.org > diff --git a/drivers/target/target_core_fabric_lib.c > b/drivers/target/target_core_fabric_lib.c > - * Copyright (c) 2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_file.c > b/drivers/target/target_core_file.c > - * Copyright (c) 2005 PyX Technologies, Inc. > - * Copyright (c) 2005-2006 SBE, Inc. All Rights Reserved. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_hba.c > b/drivers/target/target_core_hba.c > - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_iblock.c > b/drivers/target/target_core_iblock.c > - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c > - * Copyright (c) 2009, 2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_pscsi.c > b/drivers/target/target_core_pscsi.c > - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c > - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_sbc.c > b/drivers/target/target_core_sbc.c > - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_spc.c > b/drivers/target/target_core_spc.c > - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_stat.c > b/drivers/target/target_core_stat.c > - * Copyright (c) 2011 Linux-iSCSI.org > - * Copyright (c) 2006-2007 SBE, Inc. All Rights Reserved. > diff --git a/drivers/target/target_core_tmr.c > b/drivers/target/target_core_tmr.c > - * Copyright (c) 2009,2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_tpg.c > b/drivers/target/target_core_tpg.c > - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_transport.c > b/drivers/target/target_core_transport.c > - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. > - * Copyright (c) 2005, 2006, 2007 SBE, Inc. > - * Copyright (c) 2008-2010 Linux-iSCSI.org > diff --git a/drivers/target/target_core_ua.c b/drivers/target/target_core_ua.c > - * Copyright (c) 2009,2010 Linux-iSCSI.org -- To unsubscrib
Re: [PATCH] target: Update copyright ownership to 2012
On Fri, Nov 9, 2012 at 3:00 PM, Nicholas A. Bellinger n...@linux-iscsi.org wrote: This patch to update copyright year to current for principal target core ownership is now being pushed into target-pending/for-next. Pardon me, but you were just publicly accused of violating the GPL, so your response is to send a patch removing the copyright notices of all other organizations from the SCSI-target code? Have you obtained ownership of all the relevant copyrights for Linux-iSCSI.org, PyX Technologies, Inc, and SBE, Inc? If not, then this patch is an attempted violation of those organizations copyrights and of the GPL (which requires that you preserve copyright notices). Further, while these notices are the only ones listed in those files, they are not the only individuals outside of RisingTide Systems which have significant copyright interest in this code. If your goal is to obtain exclusive copyright ownership over this code then there are a great many other people you must contact and convince first. I would encourage you to talk privately with the Software Freedom Conservancy before sending more patches of this nature. Cheers, Kyle Moffett diff --git a/drivers/target/target_core_alua.c b/drivers/target/target_core_alua.c - * Copyright (c) 2009-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_configfs.c b/drivers/target/target_core_configfs.c - * Copyright (c) 2008-2011 Linux-iSCSI.org diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005-2006 SBE, Inc. All Rights Reserved. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_fabric_configfs.c b/drivers/target/target_core_fabric_configfs.c - * Copyright (c) 2010,2011 Linux-iSCSI.org diff --git a/drivers/target/target_core_fabric_lib.c b/drivers/target/target_core_fabric_lib.c - * Copyright (c) 2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c - * Copyright (c) 2005 PyX Technologies, Inc. - * Copyright (c) 2005-2006 SBE, Inc. All Rights Reserved. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_hba.c b/drivers/target/target_core_hba.c - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c - * Copyright (c) 2009, 2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_pscsi.c b/drivers/target/target_core_pscsi.c - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_spc.c b/drivers/target/target_core_spc.c - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_stat.c b/drivers/target/target_core_stat.c - * Copyright (c) 2011 Linux-iSCSI.org - * Copyright (c) 2006-2007 SBE, Inc. All Rights Reserved. diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c - * Copyright (c) 2009,2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_tpg.c b/drivers/target/target_core_tpg.c - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc. - * Copyright (c) 2005, 2006, 2007 SBE, Inc. - * Copyright (c) 2008-2010 Linux-iSCSI.org diff --git a/drivers/target/target_core_ua.c b/drivers/target/target_core_ua.c - * Copyright (c) 2009,2010 Linux-iSCSI.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Best method to control a "transmit-only" mode on fiber NICs (specifically sky2)
Hi, The company I'm working for has an unusual fiber NIC configuration that we use for one of our network appliances. We connect only a single fiber from the TX port on one NIC to the RX port on another NIC, providing a physically-one-way path for enhanced security. Unfortunately this doesn't work with most NIC drivers, as even with auto-negotiation off they look for link probe pulses before they consider the link "up" and are willing to send packets. We have been able to use Myricom 10GigE NICs with a custom firmware image. More recently we have patched the sky2 driver to turn on the FIB_FORCE_LNK flag in the PHY control register; this seems to work on the Marvell-chipset boards we have here. What would be the preferred way to control this "force link" flag? Right now we are accessing it using ethtool; we have added an additional "duplex" mode: "DUPLEX_TXONLY", with a value of 2. When you specify a speed and turn off autonegotiation ("./patched-ethtool -s eth2 speed 1000 autoneg off duplex txonly"), it will turn on the specified bit in the PHY control register and the link will automatically come up. We also have one related bug-fix^Wdirty hack for sky2 to reset the PHY a second time during netif-up after enabling interrupts; otherwise the immediate "link up" interrupt gets lost. Once I get approval from the company I will patch the post itself for review. I look forward to your comments and suggestions Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Best method to control a transmit-only mode on fiber NICs (specifically sky2)
Hi, The company I'm working for has an unusual fiber NIC configuration that we use for one of our network appliances. We connect only a single fiber from the TX port on one NIC to the RX port on another NIC, providing a physically-one-way path for enhanced security. Unfortunately this doesn't work with most NIC drivers, as even with auto-negotiation off they look for link probe pulses before they consider the link up and are willing to send packets. We have been able to use Myricom 10GigE NICs with a custom firmware image. More recently we have patched the sky2 driver to turn on the FIB_FORCE_LNK flag in the PHY control register; this seems to work on the Marvell-chipset boards we have here. What would be the preferred way to control this force link flag? Right now we are accessing it using ethtool; we have added an additional duplex mode: DUPLEX_TXONLY, with a value of 2. When you specify a speed and turn off autonegotiation (./patched-ethtool -s eth2 speed 1000 autoneg off duplex txonly), it will turn on the specified bit in the PHY control register and the link will automatically come up. We also have one related bug-fix^Wdirty hack for sky2 to reset the PHY a second time during netif-up after enabling interrupts; otherwise the immediate link up interrupt gets lost. Once I get approval from the company I will patch the post itself for review. I look forward to your comments and suggestions Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[NET/IPv6] Race condition with flow_cache_genid?
Whoops, I accidentally sent this to [EMAIL PROTECTED] instead of [EMAIL PROTECTED] Original email below: Hi, I was poking around trying to figure out how to install the Mobile IPv6 daemons this evening and noticed they required a kernel patch, although upon further inspection the kernel patch seemed to already be applied in 2.6.24. Unfortunately the flow cache appears to be horribly racy. Attached below are the only uses of the variable "flow_cache_genid" in 2.6.24. Now, I am no expert in this particular area of the code, but the "atomic_t flow_cache_genid" variable is ONLY ever used with atomic_inc() and atomic_read(). There are no memory barriers or other dec_and_test()-style functions, so that variable could just as easily be replaced with a plain old C int. Basically either there is some missing locking here or it does not need to be "atomic_t". Judging from the way it *appears* to be used to check if cache entries are up-to-date with the latest changes in policy, I would guess the former. In particular that whole "flow_cache_lookup()" thing looks racy as hell, since somebody could be in the middle of that looking at "if (fle->genid == atomic_read(_cache_genid))". It does the atomic_read(), which BTW is literally implemented as: #define atomic_read(atomicvar) ((atomicvar)->value) on some platforms. Immediately after the atomic read (or even before, since there's no cache-flush or read-modify-write), somebody calls into "selinux_xfrm_notify_policyload()" and increments the flow_cache_genid becase selinux just loaded a security policy. Now we're accepting a cache entry which applies to PREVIOUS security policy. I can only assume that's really bad. Even worse, there seems to be a race between SELinux loading a new policy and calling selinux_xfrm_notify_policyload(), since we could easily get packets and process them according to the old cache entry on one CPU before SELinux has had a chance to update the generation ID from the other. Furthermore, there's no guarantee the CPU caches will get updated in reasonable time. Clearly SELinux needs to have some way of atomically invalidating the flow cache of all CPUs *simultaneously* with loading a new policy, which probably means they both need to be under the same lock, or something. The same problem appears to occur with updating the XFRM policy and incrementing flow_cache_genid. Probably the fastest solution is to put the flow cache under the xfrm_policy_lock (which already disables local bottom-halves), and either take that lock during SELinux policy load or if there are lock ordering problems then add a variable "flow_cache_ignore" and change the xfrm_notify hooks: void selinux_xfrm_notify_policyload_pre(void) { write_lock_bh(_policy_lock); flow_cache_genid++; flow_cache_ignore = 1; write_unlock_bh(_policy_lock); } void selinux_xfrm_notify_policyload_post(void) { write_lock_bh(_policy_lock); flow_cache_ignore = 0; write_unlock_bh(_policy_lock); } Cheers, Kyle Moffett BEGIN QUOTED CODE INVOLVING flow_cache_genid: include/net/flow.h:94: extern atomic_t flow_cache_genid; net/core/flow.c:39: atomic_t flow_cache_genid = ATOMIC_INIT(0); net/core/flow.c:169:flow_cache_lookup(): if (flow_hash_rnd_recalc(cpu)) flow_new_hash_rnd(cpu); hash = flow_hash_code(key, cpu); head = _table(cpu)[hash]; for (fle = *head; fle; fle = fle->next) { if (fle->family == family && fle->dir == dir && flow_key_compare(key, >key) == 0) { if (fle->genid == atomic_read(_cache_genid)) { void *ret = fle->object; if (ret) atomic_inc(fle->object_ref); local_bh_enable(); return ret; } break; } } net/xfrm/xfrm_policy.c:1025: int xfrm_policy_delete(struct xfrm_policy *pol, int dir) { write_lock_bh(_policy_lock); pol = __xfrm_policy_unlink(pol, dir); write_unlock_bh(_policy_lock); if (pol) { if (dir < XFRM_POLICY_MAX) atomic_inc(_cache_genid); xfrm_policy_kill(pol); return 0; } return -ENOENT; } net/ipv6/inet6_connection_sock.c:142: static inline void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst, struct in6_addr *daddr, struct in6_addr *saddr) { __ip6_dst_store(sk, dst, daddr, saddr); #ifdef CONFIG_XFRM { struct rt6_info *rt = (struct rt6_info *)dst; rt->rt6i_flow_cache_genid = atomic_read(_cache
Re: [PATCH] Allow NBD to be used locally
Whoops, only hit "Reply" on the first email, sorry Jan. On Feb 2, 2008 7:54 PM, Jan Engelhardt <[EMAIL PROTECTED]> wrote: > On Feb 2 2008 18:31, [EMAIL PROTECTED] wrote: > > > >> How will that work? Fuse makes up a filesystem - not helpful > >> if you have a raw disk without a known fs to mount. > > > >take zfs-fuse or ntfs-3g for example. > >you have a blockdevice or backing-file containing data structures and fuse > >makes those show up as a filesystem. > >i think vmware-mount is not different here. > > vmware-mount IS different, it provides the _block_ device, > which is then mounted through the usual mount(2) mechanism > (if there is a filesystem driver for it). As far as I can tell, vmware-mount should be re-implemented as a little perl script around "dmsetup" and/or "losetup", possibly with "dm-userspace" patched into the kernel to allow you to handle non-mapped blocks in your userspace daemon when somebody tries to access them. If you don't need that ability then straight dm-loop and dm-linear will work. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[NET/IPv6] Race condition with flow_cache_genid?
Hi, I was poking around trying to figure out how to install the Mobile IPv6 daemons this evening and noticed they required a kernel patch, although upon further inspection the kernel patch seemed to already be applied in 2.6.24. Unfortunately the flow cache appears to be horribly racy. Attached below are the only uses of the variable "flow_cache_genid" in 2.6.24. Now, I am no expert in this particular area of the code, but the "atomic_t flow_cache_genid" variable is ONLY ever used with atomic_inc() and atomic_read(). There are no memory barriers or other dec_and_test()-style functions, so that variable could just as easily be replaced with a plain old C int. Basically either there is some missing locking here or it does not need to be "atomic_t". Judging from the way it *appears* to be used to check if cache entries are up-to-date with the latest changes in policy, I would guess the former. In particular that whole "flow_cache_lookup()" thing looks racy as hell, since somebody could be in the middle of that looking at "if (fle->genid == atomic_read(_cache_genid))". It does the atomic_read(), which BTW is literally implemented as: #define atomic_read(atomicvar) ((atomicvar)->value) on some platforms. Immediately after the atomic read (or even before, since there's no cache-flush or read-modify-write), somebody calls into "selinux_xfrm_notify_policyload()" and increments the flow_cache_genid becase selinux just loaded a security policy. Now we're accepting a cache entry which applies to PREVIOUS security policy. I can only assume that's really bad. Even worse, there seems to be a race between SELinux loading a new policy and calling selinux_xfrm_notify_policyload(), since we could easily get packets and process them according to the old cache entry on one CPU before SELinux has had a chance to update the generation ID from the other. Furthermore, there's no guarantee the CPU caches will get updated in reasonable time. Clearly SELinux needs to have some way of atomically invalidating the flow cache of all CPUs *simultaneously* with loading a new policy, which probably means they both need to be under the same lock, or something. The same problem appears to occur with updating the XFRM policy and incrementing flow_cache_genid. Probably the fastest solution is to put the flow cache under the xfrm_policy_lock (which already disables local bottom-halves), and either take that lock during SELinux policy load or if there are lock ordering problems then add a variable "flow_cache_ignore" and change the xfrm_notify hooks: void selinux_xfrm_notify_policyload_pre(void) { write_lock_bh(_policy_lock); flow_cache_genid++; flow_cache_ignore = 1; write_unlock_bh(_policy_lock); } void selinux_xfrm_notify_policyload_post(void) { write_lock_bh(_policy_lock); flow_cache_ignore = 0; write_unlock_bh(_policy_lock); } Cheers, Kyle Moffett BEGIN QUOTED CODE INVOLVING flow_cache_genid: include/net/flow.h:94: extern atomic_t flow_cache_genid; net/core/flow.c:39: atomic_t flow_cache_genid = ATOMIC_INIT(0); net/core/flow.c:169:flow_cache_lookup(): if (flow_hash_rnd_recalc(cpu)) flow_new_hash_rnd(cpu); hash = flow_hash_code(key, cpu); head = _table(cpu)[hash]; for (fle = *head; fle; fle = fle->next) { if (fle->family == family && fle->dir == dir && flow_key_compare(key, >key) == 0) { if (fle->genid == atomic_read(_cache_genid)) { void *ret = fle->object; if (ret) atomic_inc(fle->object_ref); local_bh_enable(); return ret; } break; } } net/xfrm/xfrm_policy.c:1025: int xfrm_policy_delete(struct xfrm_policy *pol, int dir) { write_lock_bh(_policy_lock); pol = __xfrm_policy_unlink(pol, dir); write_unlock_bh(_policy_lock); if (pol) { if (dir < XFRM_POLICY_MAX) atomic_inc(_cache_genid); xfrm_policy_kill(pol); return 0; } return -ENOENT; } net/ipv6/inet6_connection_sock.c:142: static inline void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst, struct in6_addr *daddr, struct in6_addr *saddr) { __ip6_dst_store(sk, dst, daddr, saddr); #ifdef CONFIG_XFRM { struct rt6_info *rt = (struct rt6_info *)dst; rt->rt6i_flow_cache_genid = atomic_read(_cache_genid); } #endif } security/selinux/include/xfrm.h:41: static inline void selinux_xfrm_notify_policyloa
[NET/IPv6] Race condition with flow_cache_genid?
Hi, I was poking around trying to figure out how to install the Mobile IPv6 daemons this evening and noticed they required a kernel patch, although upon further inspection the kernel patch seemed to already be applied in 2.6.24. Unfortunately the flow cache appears to be horribly racy. Attached below are the only uses of the variable flow_cache_genid in 2.6.24. Now, I am no expert in this particular area of the code, but the atomic_t flow_cache_genid variable is ONLY ever used with atomic_inc() and atomic_read(). There are no memory barriers or other dec_and_test()-style functions, so that variable could just as easily be replaced with a plain old C int. Basically either there is some missing locking here or it does not need to be atomic_t. Judging from the way it *appears* to be used to check if cache entries are up-to-date with the latest changes in policy, I would guess the former. In particular that whole flow_cache_lookup() thing looks racy as hell, since somebody could be in the middle of that looking at if (fle-genid == atomic_read(flow_cache_genid)). It does the atomic_read(), which BTW is literally implemented as: #define atomic_read(atomicvar) ((atomicvar)-value) on some platforms. Immediately after the atomic read (or even before, since there's no cache-flush or read-modify-write), somebody calls into selinux_xfrm_notify_policyload() and increments the flow_cache_genid becase selinux just loaded a security policy. Now we're accepting a cache entry which applies to PREVIOUS security policy. I can only assume that's really bad. Even worse, there seems to be a race between SELinux loading a new policy and calling selinux_xfrm_notify_policyload(), since we could easily get packets and process them according to the old cache entry on one CPU before SELinux has had a chance to update the generation ID from the other. Furthermore, there's no guarantee the CPU caches will get updated in reasonable time. Clearly SELinux needs to have some way of atomically invalidating the flow cache of all CPUs *simultaneously* with loading a new policy, which probably means they both need to be under the same lock, or something. The same problem appears to occur with updating the XFRM policy and incrementing flow_cache_genid. Probably the fastest solution is to put the flow cache under the xfrm_policy_lock (which already disables local bottom-halves), and either take that lock during SELinux policy load or if there are lock ordering problems then add a variable flow_cache_ignore and change the xfrm_notify hooks: void selinux_xfrm_notify_policyload_pre(void) { write_lock_bh(xfrm_policy_lock); flow_cache_genid++; flow_cache_ignore = 1; write_unlock_bh(xfrm_policy_lock); } void selinux_xfrm_notify_policyload_post(void) { write_lock_bh(xfrm_policy_lock); flow_cache_ignore = 0; write_unlock_bh(xfrm_policy_lock); } Cheers, Kyle Moffett BEGIN QUOTED CODE INVOLVING flow_cache_genid: include/net/flow.h:94: extern atomic_t flow_cache_genid; net/core/flow.c:39: atomic_t flow_cache_genid = ATOMIC_INIT(0); net/core/flow.c:169:flow_cache_lookup(): if (flow_hash_rnd_recalc(cpu)) flow_new_hash_rnd(cpu); hash = flow_hash_code(key, cpu); head = flow_table(cpu)[hash]; for (fle = *head; fle; fle = fle-next) { if (fle-family == family fle-dir == dir flow_key_compare(key, fle-key) == 0) { if (fle-genid == atomic_read(flow_cache_genid)) { void *ret = fle-object; if (ret) atomic_inc(fle-object_ref); local_bh_enable(); return ret; } break; } } net/xfrm/xfrm_policy.c:1025: int xfrm_policy_delete(struct xfrm_policy *pol, int dir) { write_lock_bh(xfrm_policy_lock); pol = __xfrm_policy_unlink(pol, dir); write_unlock_bh(xfrm_policy_lock); if (pol) { if (dir XFRM_POLICY_MAX) atomic_inc(flow_cache_genid); xfrm_policy_kill(pol); return 0; } return -ENOENT; } net/ipv6/inet6_connection_sock.c:142: static inline void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst, struct in6_addr *daddr, struct in6_addr *saddr) { __ip6_dst_store(sk, dst, daddr, saddr); #ifdef CONFIG_XFRM { struct rt6_info *rt = (struct rt6_info *)dst; rt-rt6i_flow_cache_genid = atomic_read(flow_cache_genid); } #endif } security/selinux/include/xfrm.h:41: static inline void selinux_xfrm_notify_policyload(void) { atomic_inc(flow_cache_genid); } -- To unsubscribe from this list: send the line unsubscribe
Re: [PATCH] Allow NBD to be used locally
Whoops, only hit Reply on the first email, sorry Jan. On Feb 2, 2008 7:54 PM, Jan Engelhardt [EMAIL PROTECTED] wrote: On Feb 2 2008 18:31, [EMAIL PROTECTED] wrote: How will that work? Fuse makes up a filesystem - not helpful if you have a raw disk without a known fs to mount. take zfs-fuse or ntfs-3g for example. you have a blockdevice or backing-file containing data structures and fuse makes those show up as a filesystem. i think vmware-mount is not different here. vmware-mount IS different, it provides the _block_ device, which is then mounted through the usual mount(2) mechanism (if there is a filesystem driver for it). As far as I can tell, vmware-mount should be re-implemented as a little perl script around dmsetup and/or losetup, possibly with dm-userspace patched into the kernel to allow you to handle non-mapped blocks in your userspace daemon when somebody tries to access them. If you don't need that ability then straight dm-loop and dm-linear will work. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/9] bfs: move function prototype to the proper header file
On Jan 24, 2008, at 18:13, Dmitri Vorobiev wrote: Heikki Orsila пишет: On Fri, Jan 25, 2008 at 01:32:04AM +0300, Dmitri Vorobiev wrote: +/* inode.c */ +extern void dump_imap(const char *, struct super_block *); + Functions should not be externed, remove extern keyword. Care to explain why? Following is an explanation why the contrary is probably true: 1) We have lots of precedents in existing code: [EMAIL PROTECTED]:~/Projects/misc/linux$ git-grep 'extern void' include | wc -l 5523 [EMAIL PROTECTED]:~/Projects/misc/linux$ The "extern" keyword on functions is *completely* redundant. For C variables: Declaration: extern int foo; Definition: int foo; File-scoped: static int foo; For C functions: Declaration: void foo(int x); Definition: void foo(int x) { /*...body...*/ } File-scoped: static void foo(int x) { /*...body...*/ } The compiler will *allow* you to use "extern" on the function prototype, but the presence or absence of a function body is sufficiently obvious for it to determine whether the prototype is a declaration or a definition that the "extern" keyword is not required and therefore redundant. For maximum readability and cleanliness I recommend that you leave off the "extern" on the function declarations; it makes the lines much longer without obvious gain. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/9] bfs: move function prototype to the proper header file
On Jan 24, 2008, at 18:13, Dmitri Vorobiev wrote: Heikki Orsila пишет: On Fri, Jan 25, 2008 at 01:32:04AM +0300, Dmitri Vorobiev wrote: +/* inode.c */ +extern void dump_imap(const char *, struct super_block *); + Functions should not be externed, remove extern keyword. Care to explain why? Following is an explanation why the contrary is probably true: 1) We have lots of precedents in existing code: [EMAIL PROTECTED]:~/Projects/misc/linux$ git-grep 'extern void' include | wc -l 5523 [EMAIL PROTECTED]:~/Projects/misc/linux$ The extern keyword on functions is *completely* redundant. For C variables: Declaration: extern int foo; Definition: int foo; File-scoped: static int foo; For C functions: Declaration: void foo(int x); Definition: void foo(int x) { /*...body...*/ } File-scoped: static void foo(int x) { /*...body...*/ } The compiler will *allow* you to use extern on the function prototype, but the presence or absence of a function body is sufficiently obvious for it to determine whether the prototype is a declaration or a definition that the extern keyword is not required and therefore redundant. For maximum readability and cleanliness I recommend that you leave off the extern on the function declarations; it makes the lines much longer without obvious gain. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/26] Permit filesystem local caching
On Jan 15, 2008, at 18:46, David Howells wrote: (*) 01-keys-inc-payload.diff (*) 02-keys-search-keyring.diff (*) 03-keys-callout-blob.diff One vaguely related question: Is there presently any way to adjust the per-user max-key-data limit? I've been tinkering with using the new-ish MIT kerberos "KEYRING:" credentials-cache code to hold keys for persistent daemons. Unfortunately "root" keeps hitting the limit even with only about 16 keys allocated across a few sessions. After perusing the docs I can't find any documentation on adjusting the limits. I'd really like some way to specifically allow root to allocate up to several megs worth of non-swappable key data, although I suppose just increasing the global limit slightly wouldn't be bad either. If such functionality already exists then I'd appreciate a pointer to it (and possibly respond in kind with documentation patches). Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/26] Permit filesystem local caching
On Jan 15, 2008, at 18:46, David Howells wrote: (*) 01-keys-inc-payload.diff (*) 02-keys-search-keyring.diff (*) 03-keys-callout-blob.diff One vaguely related question: Is there presently any way to adjust the per-user max-key-data limit? I've been tinkering with using the new-ish MIT kerberos KEYRING: credentials-cache code to hold keys for persistent daemons. Unfortunately root keeps hitting the limit even with only about 16 keys allocated across a few sessions. After perusing the docs I can't find any documentation on adjusting the limits. I'd really like some way to specifically allow root to allocate up to several megs worth of non-swappable key data, although I suppose just increasing the global limit slightly wouldn't be bad either. If such functionality already exists then I'd appreciate a pointer to it (and possibly respond in kind with documentation patches). Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The ext3 way of journalling
On Jan 08, 2008, at 15:51:53, Andi Kleen wrote: Theodore Tso <[EMAIL PROTECTED]> writes: Now, there are good reasons for doing periodic checks every N mounts and after M months. And it has to do with PC class hardware. (Ted's aphorism: "PC class hardware is cr*p"). If these reasons are good ones (some skepticism here) then the correct way to really handle this would be to do regular background scrubbing during runtime; ideally with metadata checksums so that you can actually detect all corruption. Poor man's background scrubbing: (A) Use LVM like virtually all modern distros offer (B) Leave some extra space in your LVM volume group (enough for 1 snapshot over the time it takes to do an FSCK). (C) Periodically run the following scriptlet: set -e START="$(date +'%Y%m%d%H%M%S')" lvcreate -s -n "${VOLUME}-snap" "${VG}/${VOLUME}" if nice +20 fsck -fy "/dev/mapper/${VG}_${VOLUME}-snap"; then echo 'Background scrubbing succeeded!' tune2fs -T "${START}" "/dev/mapper/${VG}_${VOLUME}" else echo 'Background scrubbing failed! Reboot to fsck soon!' tune2fs -C 16383 -T "19000101" "/dev/mapper/${VG}_${VOLUME}" fi lvremove "${VG}/${VOLUME}-snap" Basically you can fsck the offline snapshot in the background. If it succeeds you can adjust the "last checked" date to the time when the snapshot was taken and if it fails you can schedule an FSCK at next reboot (and possibly remount the filesystem read-only or reboot immediately). You can do the same thing for your /boot volume, although you probably have to manually use dmsetup since most bootloaders can't interpret LVM volumes. I've always been surprised that distros like RedHat which automatically use LVM don't stuff this in their weekly or monthly checks on desktop systems. User experience could also be dramatically improved with automated smartd configuration and user- interactive logging and warning messages. But since fsck is so slow and disks are so big this whole thing is a ticking time bomb now. e.g. it is not uncommon to require tens of minutes or even hours of fsck time and some server that reboots only every few months will eat that when it happens to reboot. This means you get a quite long downtime. My servers all have an "interval-between-checks" of 2-6 weeks and are configured to run nice +20 background "fsck" checks during off-hours between once every few days and once every few weeks. I also have the "max mount count" numbers set to primes between 7 and 37 (depending on the filesystem) so that troubled or frequently-rebooted systems are more frequently verified. The end result is that I almost never have the dreaded 4-hour-fsck-on-boot problem. A drive has certainly been fscked within the last few weeks of operation, and I will only ever have multiple large filesystems all fscked at the same time very rarely (gcd of their max-mount-counts). Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Jan 04, 2008, at 15:54:06, Oliver Neukum wrote: Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham: Hi. a) mount fuse on /tmp/first b) mount fuse on /tmp/second Then the server task for (a) does "ls /tmp/second". So it will be frozen, right? How do you then freeze (a)? And keep in mind that the server task may have forked. I guess I should first ask, is this a real life problem or a hypothetical twisted web? I don't see why you would want to make two filesystems interdependent - it sounds like the way to create livelock and deadlocks in normal use, before we even begin to think about hibernating. Good questions. I personally don't use fuse, but I do care about power management. The problem I see is that an unprivileged user could make that dependency, even inadvertedly. I don't think it makes sense for the kernel to try to keep track of hard data dependencies for FUSE filesystems, or to even *attempt* to auto-suspend them. You should instead allow a privileged program to initiate a "freeze-and-flush" operation on a particular FUSE filesystem and optionally wait for it to finish. Then your userspace would be configured with the appropriate data dependencies and would stop FUSE filesystems in the appropriate order. In addition, the kernel would automatically understand ext3=>loopback=>fuse, and when asked to freeze the "fuse" part, it would first freeze the "ext3" and the "loopback" parts using similar mechanisms as device-mapper currently uses when you do "dmsetup suspend mydev" followed by "echo 0 $SIZE snapshot /dev/mapper/mydev- base /dev/mapper/mydev-snap-back p 8 | dmsetup load mydev" (IE: when you create a snapshot of a given device). Naturally userspace could deadlock itself (although not the kernel) by freezing a block device and then attempting to access it, but since the "freeze" operation is limited to root this is not a big issue. The way to freeze all filesystems safely would be to clone a new mount namespace, mlockall(), mount a tmpfs, pivot_root() into the tmpfs, bind-mount the filesystems you want to freeze directly onto subdirectories of the tmpfs, and then freeze them in an appropriate order. Besides which the worst-case is a pretty straightforward non-critical failure; you might fail to fully sync a FUSE filesystem because its daemon is asleep waiting on something (possibly even just sitting in a "sleep(1)" call with all signals masked). You simply need to make sure that all tasks are asleep outside of driver critical sections so that you can properly suspend your device tree. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Jan 04, 2008, at 15:54:06, Oliver Neukum wrote: Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham: Hi. a) mount fuse on /tmp/first b) mount fuse on /tmp/second Then the server task for (a) does ls /tmp/second. So it will be frozen, right? How do you then freeze (a)? And keep in mind that the server task may have forked. I guess I should first ask, is this a real life problem or a hypothetical twisted web? I don't see why you would want to make two filesystems interdependent - it sounds like the way to create livelock and deadlocks in normal use, before we even begin to think about hibernating. Good questions. I personally don't use fuse, but I do care about power management. The problem I see is that an unprivileged user could make that dependency, even inadvertedly. I don't think it makes sense for the kernel to try to keep track of hard data dependencies for FUSE filesystems, or to even *attempt* to auto-suspend them. You should instead allow a privileged program to initiate a freeze-and-flush operation on a particular FUSE filesystem and optionally wait for it to finish. Then your userspace would be configured with the appropriate data dependencies and would stop FUSE filesystems in the appropriate order. In addition, the kernel would automatically understand ext3=loopback=fuse, and when asked to freeze the fuse part, it would first freeze the ext3 and the loopback parts using similar mechanisms as device-mapper currently uses when you do dmsetup suspend mydev followed by echo 0 $SIZE snapshot /dev/mapper/mydev- base /dev/mapper/mydev-snap-back p 8 | dmsetup load mydev (IE: when you create a snapshot of a given device). Naturally userspace could deadlock itself (although not the kernel) by freezing a block device and then attempting to access it, but since the freeze operation is limited to root this is not a big issue. The way to freeze all filesystems safely would be to clone a new mount namespace, mlockall(), mount a tmpfs, pivot_root() into the tmpfs, bind-mount the filesystems you want to freeze directly onto subdirectories of the tmpfs, and then freeze them in an appropriate order. Besides which the worst-case is a pretty straightforward non-critical failure; you might fail to fully sync a FUSE filesystem because its daemon is asleep waiting on something (possibly even just sitting in a sleep(1) call with all signals masked). You simply need to make sure that all tasks are asleep outside of driver critical sections so that you can properly suspend your device tree. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Get physical MAC address
On Jan 01, 2008, at 21:42:18, Jon Masters wrote: On Mon, 2007-12-31 at 12:39 +0700, Theewara Vorakosit wrote: I get MAC address from ioctl. However, ifconfig can change this MAC address. Can I get a real physical MAC address of the NIC? Forgive me reading into your mail...this smells a bit like some kind of licensing/compliance thing. Just bear in mind that using the MAC to verify the identity of a machine is utterly useless and pointless - anyone can trivially fool your software[0] to see what it "wants". Not necessarily; I can easily see distros wanting to have a "Restore defaults" button in their network config windows which also includes restoring the default MAC address to the NIC. It should also be pointed out that anybody with one of a selection of re-flashable NICS (or NICS with removable EEPROMS) can easily change the MAC address on their NIC. Other alternatives includes renaming eth0 to mynet0 and creating a downed dummy interface called "eth0" with the desired MAC addr. [0] We used to have to do far worse kludgery in college, in order to prevent the silly powers that be who "banned" network cards other than those made by one manufacturer from being used on their little network. Well for basically any userspace-level check, all it takes is somebody who knows ASM and has about 5 minutes to track down the problematic branch instructions. Then they just have to write a 10- line GDB script which starts the program, traps the appropriate instructions, and then changes a "0" to a "1" (or vice versa) before the conditional branch. On Windows it's vaguely practical (albeit crash-prone) to load a kernel hack which prevents your program from being debugged, but under Linux it's effectively impossible Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Get physical MAC address
On Jan 01, 2008, at 21:42:18, Jon Masters wrote: On Mon, 2007-12-31 at 12:39 +0700, Theewara Vorakosit wrote: I get MAC address from ioctl. However, ifconfig can change this MAC address. Can I get a real physical MAC address of the NIC? Forgive me reading into your mail...this smells a bit like some kind of licensing/compliance thing. Just bear in mind that using the MAC to verify the identity of a machine is utterly useless and pointless - anyone can trivially fool your software[0] to see what it wants. Not necessarily; I can easily see distros wanting to have a Restore defaults button in their network config windows which also includes restoring the default MAC address to the NIC. It should also be pointed out that anybody with one of a selection of re-flashable NICS (or NICS with removable EEPROMS) can easily change the MAC address on their NIC. Other alternatives includes renaming eth0 to mynet0 and creating a downed dummy interface called eth0 with the desired MAC addr. [0] We used to have to do far worse kludgery in college, in order to prevent the silly powers that be who banned network cards other than those made by one manufacturer from being used on their little network. Well for basically any userspace-level check, all it takes is somebody who knows ASM and has about 5 minutes to track down the problematic branch instructions. Then they just have to write a 10- line GDB script which starts the program, traps the appropriate instructions, and then changes a 0 to a 1 (or vice versa) before the conditional branch. On Windows it's vaguely practical (albeit crash-prone) to load a kernel hack which prevents your program from being debugged, but under Linux it's effectively impossible Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: yield API
On Dec 12, 2007, at 17:39:15, Jesper Juhl wrote: On 02/10/2007, Ingo Molnar <[EMAIL PROTECTED]> wrote: sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right? I have one example of sched_yield() use in a real app. Unfortunately it's proprietary so I can't show you the source, but I can tell you how it's used. The case is this: Process A forks process B. Process B does some work that takes aproximately between 50 and 1000ms to complete (varies), then it creates a file and continues to do other work. Process A needs to wait for the file B creates before it can continue. Process A *could* immediately go into some kind of "check for file; sleep n ms" loop, but instead it starts off by calling sched_yield() to give process B a chance to run and hopefully get to the point where it has created the file before process A is again scheduled and starts to look for it - after the single sched yield call, process A does indeed go into a "check for file; sleep 250ms;" loop, but most of the time the initial sched_yield() call actually results in the file being present without having to loop like that. That is a *terrible* disgusting way to use yield. Better options: (1) inotify/dnotify (2) create a "foo.lock" file and put the mutex in that (3) just start with the check-file-and-sleep loop. Now is this the best way to handle this situation? No. Does it work better than just doing the wait loop from the start? Yes. It works better than doing the wait-loop from the start? What evidence do you provide to support this assertion? Specifically, in the first case you tell the kernel "I'm waiting for something but I don't know what it is or how long it will take"; while in the second case you tell the kernel "I'm waiting for something that will take exactly X milliseconds, even though I don't know what it is. If you really want something similar to the old behavior then just replace the "sched_yield()" call with a proper sleep for the estimated time it will take the program to create the file. Is this a good way to use sched_yield()? Maybe, maybe not. But it *is* an actual use of the API in a real app. We weren't looking for "actual uses", especially not in binary-only apps. What we are looking for is optimal uses of sched_yield(); ones where that is the best alternative. This... certainly isn't. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: yield API
On Dec 12, 2007, at 17:39:15, Jesper Juhl wrote: On 02/10/2007, Ingo Molnar [EMAIL PROTECTED] wrote: sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right? I have one example of sched_yield() use in a real app. Unfortunately it's proprietary so I can't show you the source, but I can tell you how it's used. The case is this: Process A forks process B. Process B does some work that takes aproximately between 50 and 1000ms to complete (varies), then it creates a file and continues to do other work. Process A needs to wait for the file B creates before it can continue. Process A *could* immediately go into some kind of check for file; sleep n ms loop, but instead it starts off by calling sched_yield() to give process B a chance to run and hopefully get to the point where it has created the file before process A is again scheduled and starts to look for it - after the single sched yield call, process A does indeed go into a check for file; sleep 250ms; loop, but most of the time the initial sched_yield() call actually results in the file being present without having to loop like that. That is a *terrible* disgusting way to use yield. Better options: (1) inotify/dnotify (2) create a foo.lock file and put the mutex in that (3) just start with the check-file-and-sleep loop. Now is this the best way to handle this situation? No. Does it work better than just doing the wait loop from the start? Yes. It works better than doing the wait-loop from the start? What evidence do you provide to support this assertion? Specifically, in the first case you tell the kernel I'm waiting for something but I don't know what it is or how long it will take; while in the second case you tell the kernel I'm waiting for something that will take exactly X milliseconds, even though I don't know what it is. If you really want something similar to the old behavior then just replace the sched_yield() call with a proper sleep for the estimated time it will take the program to create the file. Is this a good way to use sched_yield()? Maybe, maybe not. But it *is* an actual use of the API in a real app. We weren't looking for actual uses, especially not in binary-only apps. What we are looking for is optimal uses of sched_yield(); ones where that is the best alternative. This... certainly isn't. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
On Dec 06, 2007, at 00:30:16, Renzo Davoli wrote: AF_IPN is different. AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX. It supports communication among *user* processes. Ok, you say it's different, but then you describe how IP unicast and broadcast work. Both are frequently used for communication among "*user* processes". Please provide significantly more details about exactly *how* it's different. Example: Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an Ethernet Hub and communicate among themselves with the hosting computer and the world by a tap like interface. You say "tap like" interface, but people do this already with existing infrastructure. You can connect Qemu, UML, and KVM to a standard linus "tap" interface, and then use the standard Linux bridging code to connect the "tap" interface to your existing network interfaces. Alternatively you could use the standard and well-tested IP routing/firewalling/NAT code to move your packets around. None of this requires new network infrastructure in the slightest. If you have problems with the existing code, please improve it instead of creating a slightly incompatible replacement which has different bugs and workarounds. You can also grab an interface (say eth1) and use eth0 for your hosting computer and eth1 for the IPN network of virtual machines. You can do that already with the bridging code. If you load the kvde_switch submodule IPN can be a virtual Ethernet switch. As I described above, this can be done with the existing bridging and tun/tap code. Another Example: You have a continuous stream of data packets generated by a process, and you want to send this data to many processes. Maybe the set of processes is not known in advance, you want to send the data to any interested process. Some kind of publish communication service (among unix processes not on TCP-IP). Without IPN you need a server. With IPN the sender creates the socket connects to it and feed it with data packets. All the interested receivers connects to it and start reading. That's all. This is already done frequently in userspace. Just register a port number with IANA on which to implement a "registration" server and write a little daemon to listen on 127.0.0.1:${YOUR_PORT}. Your interconnecting programs then use either unicast or multicast sockets to bind, then report to the registration server what service you are offering and what port it's on. Your "receivers" then connect to the registration server, ask what port a given service is on, and then multicast-listen or unicast-connect to access that service. The best part is that all of the performance implications are already thoroughly understood. Furthermore, if you want to extend your communication protocol to other hosts as well, you just have to replace the 127.0.0.1 bind with a global bind. This is exactly how the standard-specified multiple-participant "SIP" protocol works, for example. So if you really think this is something that belongs in the kernel you need to provide much more detailed descriptions and use-cases for why it cannot be implemented in user-space or with small modifications to existing UDP/TCP networking. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reduce stack used by lib/hexdump.c
On Dec 05, 2007, at 21:42:35, Joe Perches wrote: On Wed, 2007-12-05 at 18:18 -0800, Randy Dunlap wrote: Joe Perches wrote: Maybe just eliminate the 16 or 32 byte width option and force it to only 16 byte widths. Have you checked users (callers)? I'm pretty sure that one of the callers wanted 32 and that's why it's there. I did. There is only 1 subsystem. That's easy to change. drivers/mtd/ubi/debug.c: print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 32, 1, drivers/mtd/ubi/io.c: print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 32, 1, Long lines in the log file are not too easy to read anyway. Using 16 byte dumps per line instead of 32 isn't painful. It gets rid of the allocation, reduces the argument count and makes the kernel smaller. I think it's all good. Every current caller would have to change though. Alternatively, since print_hex_dump is not a performance-critical path (and usually indicates an error/debug condition), you could probably just make a static "hexdump_lock" spinlock and spin_lock_irqsave()/spin_unlock_irqrestore(). It would always nest inside any other lock (except during crash, where we break locks already for printk()), and I doubt any of the callers would notice the serialization since they're already serialized on the printk buffer. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reduce stack used by lib/hexdump.c
On Dec 05, 2007, at 21:42:35, Joe Perches wrote: On Wed, 2007-12-05 at 18:18 -0800, Randy Dunlap wrote: Joe Perches wrote: Maybe just eliminate the 16 or 32 byte width option and force it to only 16 byte widths. Have you checked users (callers)? I'm pretty sure that one of the callers wanted 32 and that's why it's there. I did. There is only 1 subsystem. That's easy to change. drivers/mtd/ubi/debug.c: print_hex_dump(KERN_DEBUG, , DUMP_PREFIX_OFFSET, 32, 1, drivers/mtd/ubi/io.c: print_hex_dump(KERN_DEBUG, , DUMP_PREFIX_OFFSET, 32, 1, Long lines in the log file are not too easy to read anyway. Using 16 byte dumps per line instead of 32 isn't painful. It gets rid of the allocation, reduces the argument count and makes the kernel smaller. I think it's all good. Every current caller would have to change though. Alternatively, since print_hex_dump is not a performance-critical path (and usually indicates an error/debug condition), you could probably just make a static hexdump_lock spinlock and spin_lock_irqsave()/spin_unlock_irqrestore(). It would always nest inside any other lock (except during crash, where we break locks already for printk()), and I doubt any of the callers would notice the serialization since they're already serialized on the printk buffer. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
On Dec 06, 2007, at 00:30:16, Renzo Davoli wrote: AF_IPN is different. AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX. It supports communication among *user* processes. Ok, you say it's different, but then you describe how IP unicast and broadcast work. Both are frequently used for communication among *user* processes. Please provide significantly more details about exactly *how* it's different. Example: Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an Ethernet Hub and communicate among themselves with the hosting computer and the world by a tap like interface. You say tap like interface, but people do this already with existing infrastructure. You can connect Qemu, UML, and KVM to a standard linus tap interface, and then use the standard Linux bridging code to connect the tap interface to your existing network interfaces. Alternatively you could use the standard and well-tested IP routing/firewalling/NAT code to move your packets around. None of this requires new network infrastructure in the slightest. If you have problems with the existing code, please improve it instead of creating a slightly incompatible replacement which has different bugs and workarounds. You can also grab an interface (say eth1) and use eth0 for your hosting computer and eth1 for the IPN network of virtual machines. You can do that already with the bridging code. If you load the kvde_switch submodule IPN can be a virtual Ethernet switch. As I described above, this can be done with the existing bridging and tun/tap code. Another Example: You have a continuous stream of data packets generated by a process, and you want to send this data to many processes. Maybe the set of processes is not known in advance, you want to send the data to any interested process. Some kind of publishsubscribe communication service (among unix processes not on TCP-IP). Without IPN you need a server. With IPN the sender creates the socket connects to it and feed it with data packets. All the interested receivers connects to it and start reading. That's all. This is already done frequently in userspace. Just register a port number with IANA on which to implement a registration server and write a little daemon to listen on 127.0.0.1:${YOUR_PORT}. Your interconnecting programs then use either unicast or multicast sockets to bind, then report to the registration server what service you are offering and what port it's on. Your receivers then connect to the registration server, ask what port a given service is on, and then multicast-listen or unicast-connect to access that service. The best part is that all of the performance implications are already thoroughly understood. Furthermore, if you want to extend your communication protocol to other hosts as well, you just have to replace the 127.0.0.1 bind with a global bind. This is exactly how the standard-specified multiple-participant SIP protocol works, for example. So if you really think this is something that belongs in the kernel you need to provide much more detailed descriptions and use-cases for why it cannot be implemented in user-space or with small modifications to existing UDP/TCP networking. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Relax permissions for reading hard drive serial number?
On Dec 02, 2007, at 13:45:44, Matti Aarnio wrote: This lack of having stable(*) unique system identifier available to applications is one of the small details that make node locked commercial software delivery challenging thing in UNIX environments.. *) "stable" as both stable data, and stable API to get it. Well... There's that. There's also the fact that anybody with a modicum of ASM programming skills can get clever with GDB and traces from "Correct HW serial" and "Incorrect HW serial" can write a 10- line GDB script to make it work regardless. I did something similar with a popular FPS (which I legitimately own) on one of my Mac systems after having left the DVD behind when going to a LAN party. Addresses removed to protect the innocent^Wguilty, but they took maybe 15 minutes to acquire: break *END_OF_CDKEY_CODE_DECRYPTION run delete 1 advance *JUST_AFTER_CDKEY_CHECK set $r3 = 0 detach At some point every such "locked" computer program has code like this: if (program_is_not_authorized()) { display_nasty_dialog(); exit(1); } All it takes for somebody with a debugger is to identify the last instruction of the "program_is_authorized()" function and change $r3 (or whatever return register your system uses) from a 1 to a 0. The fact remains that once the software is running on *THEIR* computer there is nothing you can practically do to forcibly prevent them from using it in whatever fashion they desire. Typically if you price your software reasonably people will be willing to pay for multiple copies but there are no foolproof technical measures to enforce that they do so. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Relax permissions for reading hard drive serial number?
On Dec 02, 2007, at 13:45:44, Matti Aarnio wrote: This lack of having stable(*) unique system identifier available to applications is one of the small details that make node locked commercial software delivery challenging thing in UNIX environments.. *) stable as both stable data, and stable API to get it. Well... There's that. There's also the fact that anybody with a modicum of ASM programming skills can get clever with GDB and traces from Correct HW serial and Incorrect HW serial can write a 10- line GDB script to make it work regardless. I did something similar with a popular FPS (which I legitimately own) on one of my Mac systems after having left the DVD behind when going to a LAN party. Addresses removed to protect the innocent^Wguilty, but they took maybe 15 minutes to acquire: break *END_OF_CDKEY_CODE_DECRYPTION run delete 1 advance *JUST_AFTER_CDKEY_CHECK set $r3 = 0 detach At some point every such locked computer program has code like this: if (program_is_not_authorized()) { display_nasty_dialog(); exit(1); } All it takes for somebody with a debugger is to identify the last instruction of the program_is_authorized() function and change $r3 (or whatever return register your system uses) from a 1 to a 0. The fact remains that once the software is running on *THEIR* computer there is nothing you can practically do to forcibly prevent them from using it in whatever fashion they desire. Typically if you price your software reasonably people will be willing to pay for multiple copies but there are no foolproof technical measures to enforce that they do so. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Development & Objective-C
On Nov 30, 2007, at 13:40:07, H. Peter Anvin wrote: Kyle Moffett wrote: With that said, there is a significant performance penalty as all Objective-C method calls are looked up symbolically at runtime for every single call. GACK! At least C++ has vtables. In a tight loop there is a way to do a single symbolic lookup and just call directly through a function pointer, but typically it isn't necessary for GUI programs and the like. The flexibility of being able to dynamically add new methods to an existing class (at least for desktop user interfaces) significantly outweighs the performance cost. Any performance-sensitive code is typically written in straight C anyways. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Development & Objective-C
On Nov 30, 2007, at 09:34:45, Lennart Sorensen wrote: On Thu, Nov 29, 2007 at 12:14:16PM +, Ben Crowhurst wrote: Has Objective-C ever been considered for kernel development? Doesn't objective C essentially require a runtime to provide a lot of the features of the language? If it does (as I suspect) then it is totally unsiatable for kernel development. That and object oriented languages in general are badly designed and a bad idea. Having not used objective C I have no idea if it qualifies as badly designed or not. Certainly C++ and java are both very badly designed. Objective-C is actually a pretty minimal wrapper around C; it was originally implemented as a C preprocessor. It generally does not have any kind of memory management, garbage collection, or anything else (although typically a "runtime" will provide those features). There are no first-class exceptions, so there would be nothing to worry about there (the exceptions used in GUI programs are built around the setjmp/longjmp primitives). Objective-C is also almost completely backwards-compatible with C, much more so than C++ ever was. As far as the runtime goes the kernel would be expected to write its own, the same way that it implements "kmalloc()" as part of a "C runtime". Since the runtime itself never does any implicit memory allocation, I think it would conceivably even be relatively safe for kernel usage. With that said, there is a significant performance penalty as all Objective-C method calls are looked up symbolically at runtime for every single call. For GUI programs where large chunks of the code are event-loops and not performance-sensitive that provides a huge amount of extra flexibility. In the kernel though, there are many codepaths where *every* *single* instruction counts; that could be a serious performance hit. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Development Objective-C
On Nov 30, 2007, at 13:40:07, H. Peter Anvin wrote: Kyle Moffett wrote: With that said, there is a significant performance penalty as all Objective-C method calls are looked up symbolically at runtime for every single call. GACK! At least C++ has vtables. In a tight loop there is a way to do a single symbolic lookup and just call directly through a function pointer, but typically it isn't necessary for GUI programs and the like. The flexibility of being able to dynamically add new methods to an existing class (at least for desktop user interfaces) significantly outweighs the performance cost. Any performance-sensitive code is typically written in straight C anyways. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Development Objective-C
On Nov 30, 2007, at 09:34:45, Lennart Sorensen wrote: On Thu, Nov 29, 2007 at 12:14:16PM +, Ben Crowhurst wrote: Has Objective-C ever been considered for kernel development? Doesn't objective C essentially require a runtime to provide a lot of the features of the language? If it does (as I suspect) then it is totally unsiatable for kernel development. That and object oriented languages in general are badly designed and a bad idea. Having not used objective C I have no idea if it qualifies as badly designed or not. Certainly C++ and java are both very badly designed. Objective-C is actually a pretty minimal wrapper around C; it was originally implemented as a C preprocessor. It generally does not have any kind of memory management, garbage collection, or anything else (although typically a runtime will provide those features). There are no first-class exceptions, so there would be nothing to worry about there (the exceptions used in GUI programs are built around the setjmp/longjmp primitives). Objective-C is also almost completely backwards-compatible with C, much more so than C++ ever was. As far as the runtime goes the kernel would be expected to write its own, the same way that it implements kmalloc() as part of a C runtime. Since the runtime itself never does any implicit memory allocation, I think it would conceivably even be relatively safe for kernel usage. With that said, there is a significant performance penalty as all Objective-C method calls are looked up symbolically at runtime for every single call. For GUI programs where large chunks of the code are event-loops and not performance-sensitive that provides a huge amount of extra flexibility. In the kernel though, there are many codepaths where *every* *single* instruction counts; that could be a serious performance hit. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: git guidance
On Nov 29, 2007, at 00:27:04, Al Boldi wrote: Jakub Narebski wrote: Besides, you can always use "git show :". For example gitweb (and I think other web interfaces) can show any version of a file or a directory, accessing only repository. Sure, browsing is the easy part, but Version Control starts when things become writable. But... git history is very inherently completely immutable once created... that's the only way you can index everything with a simple SHA-1. If you want to write to the "git filesystem" by adding new commits then you need to use the appropriate commands, same as every other VCS on the planet. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: git guidance
On Nov 29, 2007, at 00:27:04, Al Boldi wrote: Jakub Narebski wrote: Besides, you can always use git show revision:file. For example gitweb (and I think other web interfaces) can show any version of a file or a directory, accessing only repository. Sure, browsing is the easy part, but Version Control starts when things become writable. But... git history is very inherently completely immutable once created... that's the only way you can index everything with a simple SHA-1. If you want to write to the git filesystem by adding new commits then you need to use the appropriate commands, same as every other VCS on the planet. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Nov 27, 2007, at 17:49:18, Jeremy Fitzhardinge wrote: Rafael J. Wysocki wrote: Well, this is more-or-less how we all imagine that should be done eventually. The main problem is how to implement it without causing too much breakage. Also, there are some dirty details that need to be taken into consideration. For Xen suspend/resume, I'd like to use the freezer to get all threads into a known consistent state (where, specifically, they don't have any outstanding pagetable updates pending). In other words, the freezer as it currently stands is what I want, modulo some of these issues where it gets caught up unexpectedly. If threads end up getting frozen anywhere preempt isn't explicitly disabled, it wouldn't work for me. The problem with "one freezer" is that "known consistent state" means something completely different to every single driver and subsystem. Xen wants it to mean "No pending page table updates and no more updates from this point forward". A network driver wants it to mean "All pending network packets DMAed out or in and the device shut down with all remaining packets queued. A SATA controller wants it to mean "All DMA quiesced and no more commands", etc. The only way to have that work is to put minimal definitions of what state you care about in the drivers themselves. For Xen this means that you need to have an appropriately-timed suspend handler which hooks into Xen code very precisely to create and preserve the "No pending page table updates" state that you care about. It will be more work in the short term but it's the only maintainable solution in the long term IMO. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Nov 27, 2007, at 12:40:24, Rafael J. Wysocki wrote: On Tuesday, 27 of November 2007, Matthew Garrett wrote: On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote: On Monday, 26 of November 2007, David Chinner wrote: So how do you handle threads that are blocked on I/O or a lock during the system freeze process, then? We wait until they can continue. So if I have a process blocked on an unavilable NFS mount, I can't suspend? That's correct, you can't. [And I know what you're going to say. ;-)] Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE" instead of a zero preempt_count()? Really what we should do is just iterate over all of the actual physical devices and tell each one "Block new IO requests preemptably, finish pending DMA, put the hardware in low-power mode, and prepare for suspend/hibernate". As long as each driver knows how to do those simple things we can have an entirely consistent kernel image for both suspend and for hibernation. When all tasks are preemptable we can very trivially rely on the drivers to enforce the "Stop new IO submission" with a dirt-simple semaphore or waitqueue. The sleep itself will be TASK_UNINTERRUPTIBLE, but it will be done from a preemptible context. That way the system suspend time is the sum of the suspend times of the devices on the system, and the suspend time of any given device is the sum of its maximum non-preemptible critical section and the time to flush all of its remaining pending DMA/etc. This is almost completely independent of the load-level of the machine, and it does not depend on things like NFS filesystems. The one gotcha is that it does not flush dirty filesystem pages to disk first, although that could be fixed with a few VFS and blockdev hooks which hierarchically flush and "freeze" block devices and filesystems before actually disabling devices much the way that device-mapper can pause a device to take a snapshot and end up with a clean journal on the filesystem afterwards. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Nov 27, 2007, at 12:40:24, Rafael J. Wysocki wrote: On Tuesday, 27 of November 2007, Matthew Garrett wrote: On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote: On Monday, 26 of November 2007, David Chinner wrote: So how do you handle threads that are blocked on I/O or a lock during the system freeze process, then? We wait until they can continue. So if I have a process blocked on an unavilable NFS mount, I can't suspend? That's correct, you can't. [And I know what you're going to say. ;-)] Why exactly does suspend/hibernation depend on TASK_INTERRUPTIBLE instead of a zero preempt_count()? Really what we should do is just iterate over all of the actual physical devices and tell each one Block new IO requests preemptably, finish pending DMA, put the hardware in low-power mode, and prepare for suspend/hibernate. As long as each driver knows how to do those simple things we can have an entirely consistent kernel image for both suspend and for hibernation. When all tasks are preemptable we can very trivially rely on the drivers to enforce the Stop new IO submission with a dirt-simple semaphore or waitqueue. The sleep itself will be TASK_UNINTERRUPTIBLE, but it will be done from a preemptible context. That way the system suspend time is the sum of the suspend times of the devices on the system, and the suspend time of any given device is the sum of its maximum non-preemptible critical section and the time to flush all of its remaining pending DMA/etc. This is almost completely independent of the load-level of the machine, and it does not depend on things like NFS filesystems. The one gotcha is that it does not flush dirty filesystem pages to disk first, although that could be fixed with a few VFS and blockdev hooks which hierarchically flush and freeze block devices and filesystems before actually disabling devices much the way that device-mapper can pause a device to take a snapshot and end up with a clean journal on the filesystem afterwards. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: freeze vs freezer
On Nov 27, 2007, at 17:49:18, Jeremy Fitzhardinge wrote: Rafael J. Wysocki wrote: Well, this is more-or-less how we all imagine that should be done eventually. The main problem is how to implement it without causing too much breakage. Also, there are some dirty details that need to be taken into consideration. For Xen suspend/resume, I'd like to use the freezer to get all threads into a known consistent state (where, specifically, they don't have any outstanding pagetable updates pending). In other words, the freezer as it currently stands is what I want, modulo some of these issues where it gets caught up unexpectedly. If threads end up getting frozen anywhere preempt isn't explicitly disabled, it wouldn't work for me. The problem with one freezer is that known consistent state means something completely different to every single driver and subsystem. Xen wants it to mean No pending page table updates and no more updates from this point forward. A network driver wants it to mean All pending network packets DMAed out or in and the device shut down with all remaining packets queued. A SATA controller wants it to mean All DMA quiesced and no more commands, etc. The only way to have that work is to put minimal definitions of what state you care about in the drivers themselves. For Xen this means that you need to have an appropriately-timed suspend handler which hooks into Xen code very precisely to create and preserve the No pending page table updates state that you care about. It will be more work in the short term but it's the only maintainable solution in the long term IMO. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree
On Nov 24, 2007, at 22:36:43, Crispin Cowan wrote: Kyle Moffett wrote: Actually, a fully-secured strict-mode SELinux system will have no unconfined_t processes; none of my test systems have any. Generally "unconfined_t" is used for situations similar to what AppArmor was designed for, where the only "interesting" security is that of the daemon (which is properly labelled) and one or more of the users are unconfined. Interesting. In a Targeted Policy, you do your policy administration from unconfined_t. But how do you administer a Strict Policy machine? I can think of 2 ways: [snip] * there is some type that is tighter than unconfined_t but none the less has sufficient privilege to change policy To me, this would be semantically equivalent to unconfined_t, because any rogue code or user with this type could then fabricate unconfined_t and do what they want Well, in a strict SELinux system, someone who has been permitted the "Security Administrator" role (secadm_r) and who has logged in through a "login_t" process may modify and reload the policy. They are also permitted to view all files up to their clearance, write files below their level, and relabel files. On the other hand, they do not have any system-administration privileges (those are reserve for sysadm_r). Under the default policy the security administrator may disable SELinux completely, although that too can be adjusted as "load policy" is yet another specialized permission. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree
On Nov 24, 2007, at 22:36:43, Crispin Cowan wrote: Kyle Moffett wrote: Actually, a fully-secured strict-mode SELinux system will have no unconfined_t processes; none of my test systems have any. Generally unconfined_t is used for situations similar to what AppArmor was designed for, where the only interesting security is that of the daemon (which is properly labelled) and one or more of the users are unconfined. Interesting. In a Targeted Policy, you do your policy administration from unconfined_t. But how do you administer a Strict Policy machine? I can think of 2 ways: [snip] * there is some type that is tighter than unconfined_t but none the less has sufficient privilege to change policy To me, this would be semantically equivalent to unconfined_t, because any rogue code or user with this type could then fabricate unconfined_t and do what they want Well, in a strict SELinux system, someone who has been permitted the Security Administrator role (secadm_r) and who has logged in through a login_t process may modify and reload the policy. They are also permitted to view all files up to their clearance, write files below their level, and relabel files. On the other hand, they do not have any system-administration privileges (those are reserve for sysadm_r). Under the default policy the security administrator may disable SELinux completely, although that too can be adjusted as load policy is yet another specialized permission. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree
On Nov 24, 2007, at 06:39:34, Crispin Cowan wrote: Andrew Morgan wrote: It feels to me as if a MAC "override capability" is, if true to its name, extra to the MAC model; any MAC model that needs an 'override' to function seems under-specified... SELinux clearly feels no need for one, That's not quite right. More specifically, it already has one in the form of unconfined_t. AppArmor has a similar escape hatch in the "Ux" permission. Its not that they don't need one, it is that they already have one. They get to have one because they allow you to actually write a policy that is more nuanced than "process label must dominate object label". Actually, a fully-secured strict-mode SELinux system will have no unconfined_t processes; none of my test systems have any. Generally "unconfined_t" is used for situations similar to what AppArmor was designed for, where the only "interesting" security is that of the daemon (which is properly labelled) and one or more of the users are unconfined. Even then "unconfined_t" is not an implicit part of the policy, it is explicitly given the ability to take any action on any object by rules in the policy, and it typically still falls under a few MLS labeling restrictions even in the targeted policy. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree
On Nov 24, 2007, at 06:39:34, Crispin Cowan wrote: Andrew Morgan wrote: It feels to me as if a MAC override capability is, if true to its name, extra to the MAC model; any MAC model that needs an 'override' to function seems under-specified... SELinux clearly feels no need for one, That's not quite right. More specifically, it already has one in the form of unconfined_t. AppArmor has a similar escape hatch in the Ux permission. Its not that they don't need one, it is that they already have one. They get to have one because they allow you to actually write a policy that is more nuanced than process label must dominate object label. Actually, a fully-secured strict-mode SELinux system will have no unconfined_t processes; none of my test systems have any. Generally unconfined_t is used for situations similar to what AppArmor was designed for, where the only interesting security is that of the daemon (which is properly labelled) and one or more of the users are unconfined. Even then unconfined_t is not an implicit part of the policy, it is explicitly given the ability to take any action on any object by rules in the policy, and it typically still falls under a few MLS labeling restrictions even in the targeted policy. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Documentation about unaligned memory access
On Nov 22, 2007, at 20:29:11, Alan Cox wrote: Most architectures are unable to perform unaligned memory accesses. Any unaligned access causes a processor exception. Not all. Some simply produce the wrong answer - thats oh so much more exciting. As one example, the MicroBlaze soft-core processor family designed for use on Xilinx FPGAs will (by default) simply forcibly zero the lower bits of the unaligned address, such that the following code will fail mysteriously: const char foo[] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 }; printf("0x%08lx 0x%08lx 0x%08lx 0x%08lx\n", *((u32 *)(foo+0)), *((u32 *)(foo+1)), *((u32 *)(foo+2)), *((u32 *)(foo+3))); Instead of outputting: 0x00010203 0x01020304 0x02030405 0x03040506 It will output: 0x00010203 0x00010203 0x00010203 0x00010203 Other embedded architectures have very similar problems. Some may provide an "unaligned data access" exception, but offer insufficient information to repair the damage and resume execution. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Documentation about unaligned memory access
On Nov 22, 2007, at 20:29:11, Alan Cox wrote: Most architectures are unable to perform unaligned memory accesses. Any unaligned access causes a processor exception. Not all. Some simply produce the wrong answer - thats oh so much more exciting. As one example, the MicroBlaze soft-core processor family designed for use on Xilinx FPGAs will (by default) simply forcibly zero the lower bits of the unaligned address, such that the following code will fail mysteriously: const char foo[] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 }; printf(0x%08lx 0x%08lx 0x%08lx 0x%08lx\n, *((u32 *)(foo+0)), *((u32 *)(foo+1)), *((u32 *)(foo+2)), *((u32 *)(foo+3))); Instead of outputting: 0x00010203 0x01020304 0x02030405 0x03040506 It will output: 0x00010203 0x00010203 0x00010203 0x00010203 Other embedded architectures have very similar problems. Some may provide an unaligned data access exception, but offer insufficient information to repair the damage and resume execution. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futexes and network filesystems.
On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote: I had a chance to think about this a bit more, and realized that the problem is that futexes don't appear to work on network filesystems, even if the network filesystems provide coherent shared memory. It seems to me that we need to have a call that gets a unique token for a process for each filesystem per filesystem for use in futexes (especially robust futexes). Say get_fs_task_id(const char *path); On local filesystems this could just be the pid as we use today, but for filesystems that can be accessed from contexts with potentially overlapping pid values this could be something else. It is an extra syscall in the preparation path, but it should be hardly more expensive the current getpid(). Once we have fixed the futex infrastructure to be able to handle futexes on network filesystems, the pid namespace case will be trivial to implement. Actually, I would think that get_vm_task_id(void *addr) would be a more useful interface. The call would still be a relatively simple lookup to find the struct file associated with the particular virtual mapping, but it would be race-free from the perspective of userspace and would not require that we somehow figure out the file descriptor associated with a particular mmap() (which may be closed by this point in time). Useful extension would be the get_fd_task_id(int fd) and get_fs_task_id(const char *path), but those are less important. The other important thing is to ensure that somehow the numbers are considered unique only within the particular domain of a container, such that you can migrate a container from one system to another even using a simple local ext3 filesystem (on a networked block device) and still be able to have things work properly even after the migration. Naturally this would only work with an upgraded libc but I think that's a reasonable requirement to enforce for migration of futexes and cross-network futexes. Even for network filesystems which don't implement coherent shared memory, you might add a memexcl() system call which (when used by multiple cooperating processes) ensures that a given page is only ever mapped by at most one computer accessing a given network filesystem. The page-outs and page-ins when shuttling that page across the network would be expensive, but I believe the cost would be reasonable for many applications and it would allow traditional atomic ops on the mapped pages to take and release futexes in the uncontended case. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futexes and network filesystems.
On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote: I had a chance to think about this a bit more, and realized that the problem is that futexes don't appear to work on network filesystems, even if the network filesystems provide coherent shared memory. It seems to me that we need to have a call that gets a unique token for a process for each filesystem per filesystem for use in futexes (especially robust futexes). Say get_fs_task_id(const char *path); On local filesystems this could just be the pid as we use today, but for filesystems that can be accessed from contexts with potentially overlapping pid values this could be something else. It is an extra syscall in the preparation path, but it should be hardly more expensive the current getpid(). Once we have fixed the futex infrastructure to be able to handle futexes on network filesystems, the pid namespace case will be trivial to implement. Actually, I would think that get_vm_task_id(void *addr) would be a more useful interface. The call would still be a relatively simple lookup to find the struct file associated with the particular virtual mapping, but it would be race-free from the perspective of userspace and would not require that we somehow figure out the file descriptor associated with a particular mmap() (which may be closed by this point in time). Useful extension would be the get_fd_task_id(int fd) and get_fs_task_id(const char *path), but those are less important. The other important thing is to ensure that somehow the numbers are considered unique only within the particular domain of a container, such that you can migrate a container from one system to another even using a simple local ext3 filesystem (on a networked block device) and still be able to have things work properly even after the migration. Naturally this would only work with an upgraded libc but I think that's a reasonable requirement to enforce for migration of futexes and cross-network futexes. Even for network filesystems which don't implement coherent shared memory, you might add a memexcl() system call which (when used by multiple cooperating processes) ensures that a given page is only ever mapped by at most one computer accessing a given network filesystem. The page-outs and page-ins when shuttling that page across the network would be expensive, but I believe the cost would be reasonable for many applications and it would allow traditional atomic ops on the mapped pages to take and release futexes in the uncontended case. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High priority tasks break SMP balancer?
First of all, since Ingo Molnar seems to be one of the head scheduler gurus, you might CC him on this. Also added a couple other useful CCs for regression reports. On Nov 09, 2007, at 19:11:03, Micah Dowty wrote: As I said, YMMV. I haven't been able to find a single set of parameters for the demo program which cause the problem to occur 100% of the time on all systems. In general, boosting the MAINTHREAD_PRIORITY even more and increasing the WAKE_HZ should exaggerate the problem. These parameters reproduce the problem very reliably on my system: #define NUM_BUSY_THREADS2 #define MAINTHREAD_PRIORITY -20 #define MAINTHREAD_WAKE_HZ 1024 #define MAINTHREAD_LOAD_PERCENT 5 #define MAINTHREAD_LOAD_CYCLES 2 Well from these statistics; if you are requesting wakeups that often then it is probably *not* correct to try to move another thread to that CPU in the mean-time. Essentially the migration cost will likely far outweigh the advantage of letting it run a little bit of extra time, and in addition it will dump out cache from the high- priority thread. As per the description I think that an increased a priority and increased WAKE_HZ will certainly cause the "problem" to occur more, simply because it reduces the time between wakeups of the high-priority process and makes it less helpful to migrate another process over to that CPU during the sleep periods. This will also depend on your hardware and possibly other configuration parameters. I'm not really that much of an expert in this particular area, though, so it's entirely possible that one of the above-mentioned scheduler head-honchos will poke holes in my argument and give a better explanation or a possible patch. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High priority tasks break SMP balancer?
First of all, since Ingo Molnar seems to be one of the head scheduler gurus, you might CC him on this. Also added a couple other useful CCs for regression reports. On Nov 09, 2007, at 19:11:03, Micah Dowty wrote: As I said, YMMV. I haven't been able to find a single set of parameters for the demo program which cause the problem to occur 100% of the time on all systems. In general, boosting the MAINTHREAD_PRIORITY even more and increasing the WAKE_HZ should exaggerate the problem. These parameters reproduce the problem very reliably on my system: #define NUM_BUSY_THREADS2 #define MAINTHREAD_PRIORITY -20 #define MAINTHREAD_WAKE_HZ 1024 #define MAINTHREAD_LOAD_PERCENT 5 #define MAINTHREAD_LOAD_CYCLES 2 Well from these statistics; if you are requesting wakeups that often then it is probably *not* correct to try to move another thread to that CPU in the mean-time. Essentially the migration cost will likely far outweigh the advantage of letting it run a little bit of extra time, and in addition it will dump out cache from the high- priority thread. As per the description I think that an increased a priority and increased WAKE_HZ will certainly cause the problem to occur more, simply because it reduces the time between wakeups of the high-priority process and makes it less helpful to migrate another process over to that CPU during the sleep periods. This will also depend on your hardware and possibly other configuration parameters. I'm not really that much of an expert in this particular area, though, so it's entirely possible that one of the above-mentioned scheduler head-honchos will poke holes in my argument and give a better explanation or a possible patch. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix isspace() and other ctype.h functions to ignore chars 128-255
Originally isspace() and other similar functions in ctype.h ignored any character with the high bit set; however this was changed during the linux 2.1 days to map Latin-1. As following Latin-1 will most likely break UTF-8 any any *other* encoding that is backwards- compatible with 7-bit-ASCII, change ctype.c to ignore such characters completely (the way they were before). Linus seems to think this is a good thing, and he's the one that wrote the code in the first place. Signed-off-by: Kyle Moffett <[EMAIL PROTECTED]> --- On Nov 06, 2007, at 10:53:08, Linus Torvalds wrote: On Tue, 6 Nov 2007, Kyle Moffett wrote: Personally I think that isspace() accepting character 0xA0 is a bug I think I agree with you. As far as the kernel is concerned, "isspace()" should just accept the obvious spaces (hardspace, tab, newline), and *perhaps* the VT/FF kind of things. You should realize that the kernel thing is *ancient*. It's basically there from v0.01, and while the really original one (I just checked) had all the non-ascii characters not trigger anything, it was converted to be latin1 in the 2.1.x timeframe. That's a *loong* time ago. Way before UTF-8 and other things were really common. So we should probably just make all the upper 128 bytes go back to "don't trigger anything in ctype.h" - they'd not be spaces, but they'd not be control characters or anything else either. lib/ctype.c | 17 +++-- 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/lib/ctype.c b/lib/ctype.c index d02ace1..ce2807a 100644 --- a/lib/ctype.c +++ b/lib/ctype.c @@ -24,13 +24,18 @@ _P,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L, /* 96-103 */ _L,_L,_L,_L,_L,_L,_L,_L, /* 104-111 */ _L,_L,_L,_L,_L,_L,_L,_L, /* 112-119 */ _L,_L,_L,_P,_P,_P,_P,_C, /* 120-127 */ + +/* + * None of these match any type bits to avoid screwing up UTF-8 or any other + * 7-bit-ASCII-compatible encoding. + */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 128-143 */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 144-159 */ -_S|_SP,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P, /* 160-175 */ -_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P, /* 176-191 */ -_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U, /* 192-207 */ -_U,_U,_U,_U,_U,_U,_U,_P,_U,_U,_U,_U,_U,_U,_U,_L, /* 208-223 */ -_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L, /* 224-239 */ -_L,_L,_L,_L,_L,_L,_L,_P,_L,_L,_L,_L,_L,_L,_L,_L}; /* 240-255 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 160-175 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 176-191 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 192-207 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 208-223 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 224-239 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; /* 240-255 */ EXPORT_SYMBOL(_ctype);
[PATCH] Fix isspace() and other ctype.h functions to ignore chars 128-255
Originally isspace() and other similar functions in ctype.h ignored any character with the high bit set; however this was changed during the linux 2.1 days to map Latin-1. As following Latin-1 will most likely break UTF-8 any any *other* encoding that is backwards- compatible with 7-bit-ASCII, change ctype.c to ignore such characters completely (the way they were before). Linus seems to think this is a good thing, and he's the one that wrote the code in the first place. Signed-off-by: Kyle Moffett [EMAIL PROTECTED] --- On Nov 06, 2007, at 10:53:08, Linus Torvalds wrote: On Tue, 6 Nov 2007, Kyle Moffett wrote: Personally I think that isspace() accepting character 0xA0 is a bug I think I agree with you. As far as the kernel is concerned, isspace() should just accept the obvious spaces (hardspace, tab, newline), and *perhaps* the VT/FF kind of things. You should realize that the kernel ctype.h thing is *ancient*. It's basically there from v0.01, and while the really original one (I just checked) had all the non-ascii characters not trigger anything, it was converted to be latin1 in the 2.1.x timeframe. That's a *loong* time ago. Way before UTF-8 and other things were really common. So we should probably just make all the upper 128 bytes go back to don't trigger anything in ctype.h - they'd not be spaces, but they'd not be control characters or anything else either. lib/ctype.c | 17 +++-- 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/lib/ctype.c b/lib/ctype.c index d02ace1..ce2807a 100644 --- a/lib/ctype.c +++ b/lib/ctype.c @@ -24,13 +24,18 @@ _P,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L, /* 96-103 */ _L,_L,_L,_L,_L,_L,_L,_L, /* 104-111 */ _L,_L,_L,_L,_L,_L,_L,_L, /* 112-119 */ _L,_L,_L,_P,_P,_P,_P,_C, /* 120-127 */ + +/* + * None of these match any type bits to avoid screwing up UTF-8 or any other + * 7-bit-ASCII-compatible encoding. + */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 128-143 */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 144-159 */ -_S|_SP,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P, /* 160-175 */ -_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P, /* 176-191 */ -_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U, /* 192-207 */ -_U,_U,_U,_U,_U,_U,_U,_P,_U,_U,_U,_U,_U,_U,_U,_L, /* 208-223 */ -_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L, /* 224-239 */ -_L,_L,_L,_L,_L,_L,_L,_P,_L,_L,_L,_L,_L,_L,_L,_L}; /* 240-255 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 160-175 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 176-191 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 192-207 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 208-223 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 224-239 */ +0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; /* 240-255 */ EXPORT_SYMBOL(_ctype);
Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
On Nov 06, 2007, at 07:23:36, Ahmed S. Darwish wrote: On 11/6/07, Adrian Bunk <[EMAIL PROTECTED]> wrote: On Tue, Nov 06, 2007 at 01:34:05PM +0200, Ahmed S. Darwish wrote: As far as I understand the problem now, isspace() accepts the 0xa0 character which might collide with some of UTF-8 encoded characters cause the high bit is set. I admit I'm not experienced in such encoding stuff, but shouldn't the ASCII and the ASCII-compatible UTF-8 encodings be enough for the labels? It would not work if someone would e.g. give you UTF-16 encoded strings, but I don't see this happening in practice. Won't this complicate the code too much ? Well the VFS (for example) certainly doesn't support any encodings other than various extended-ASCII forms (which includes UTF-8). Something like UTF-16 has extra null characters in-between every normal character, and as such would fail completely if passed to the VFS. Personally I think that isspace() accepting character 0xA0 is a bug, as there are several variants of extended ASCII only one of which has that character as a space. Others have it as á (accented A), etc. In addition the "canonical" internal text format of the kernel is UTF-8 as that encoding can represent any character in any other encoding and it is backwards-compatible with traditional ASCII. Cheers, Kyle Moffett- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
On Nov 06, 2007, at 01:33:05, Adrian Bunk wrote: Can you limit this to 7bit ASCII and use isascii() somewhere? Otherwise I'd expect funny things to happen when you e.g. use isspace() on the UTF-8 encoded character à. Actually, you don't need to. You tell them it expects UTF-8 encoded strings and be done with it. All US-ASCII characters from 0 through 127 (IE: high bit clear) are exactly the same in UTF-8, and UTF-8 special characters have the high bit set in all bytes. Therefore you just assume that anything with the high bit set is part of a word and you can handle basic UTF-8. (It doesn't work on special UTF-8 space characters like nonbreaking space and similar, but handling those is significantly more complicated). Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
On Nov 06, 2007, at 01:33:05, Adrian Bunk wrote: Can you limit this to 7bit ASCII and use isascii() somewhere? Otherwise I'd expect funny things to happen when you e.g. use isspace() on the UTF-8 encoded character à. Actually, you don't need to. You tell them it expects UTF-8 encoded strings and be done with it. All US-ASCII characters from 0 through 127 (IE: high bit clear) are exactly the same in UTF-8, and UTF-8 special characters have the high bit set in all bytes. Therefore you just assume that anything with the high bit set is part of a word and you can handle basic UTF-8. (It doesn't work on special UTF-8 space characters like nonbreaking space and similar, but handling those is significantly more complicated). Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
On Nov 06, 2007, at 07:23:36, Ahmed S. Darwish wrote: On 11/6/07, Adrian Bunk [EMAIL PROTECTED] wrote: On Tue, Nov 06, 2007 at 01:34:05PM +0200, Ahmed S. Darwish wrote: As far as I understand the problem now, isspace() accepts the 0xa0 character which might collide with some of UTF-8 encoded characters cause the high bit is set. I admit I'm not experienced in such encoding stuff, but shouldn't the ASCII and the ASCII-compatible UTF-8 encodings be enough for the labels? It would not work if someone would e.g. give you UTF-16 encoded strings, but I don't see this happening in practice. Won't this complicate the code too much ? Well the VFS (for example) certainly doesn't support any encodings other than various extended-ASCII forms (which includes UTF-8). Something like UTF-16 has extra null characters in-between every normal character, and as such would fail completely if passed to the VFS. Personally I think that isspace() accepting character 0xA0 is a bug, as there are several variants of extended ASCII only one of which has that character as a space. Others have it as á (accented A), etc. In addition the canonical internal text format of the kernel is UTF-8 as that encoding can represent any character in any other encoding and it is backwards-compatible with traditional ASCII. Cheers, Kyle Moffett- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
On Nov 03, 2007, at 12:43:06, Ahmed S. Darwish wrote: Bashv3 builtin "echo" behaves very strangely to -EINVAL. It sends all the buffers that causes -EINVAL again in subsequent echo invocations. i.e. echo "Invalid Rule" > /smack/load # -EINVAL returned echo "Valid Rule" > /smack/load In seconod iteration, echo sends the first invalid buffer again then sends the new one. This causes a "Invalid Rule\nValid Rule" buffer sent to write(). IMHO, this is a bug in builtin echo. The external /bin/echo doesn't cause such strange behaviour. Actually, what causes problems here is something between a bug and a feature in libc's buffering. Basically the -EINVAL error causes libc to leave its data in the file-output buffer despite the file being closed and reopened. Since a standalone echo just exits that buffer is discarded, but for the bash builtin it hangs around in the buffer for a while and ends up getting prepended to the following echo statement. There's actually multiple ways to make this fail; this is just the simplest. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
On Nov 03, 2007, at 12:43:06, Ahmed S. Darwish wrote: Bashv3 builtin echo behaves very strangely to -EINVAL. It sends all the buffers that causes -EINVAL again in subsequent echo invocations. i.e. echo Invalid Rule /smack/load # -EINVAL returned echo Valid Rule /smack/load In seconod iteration, echo sends the first invalid buffer again then sends the new one. This causes a Invalid Rule\nValid Rule buffer sent to write(). IMHO, this is a bug in builtin echo. The external /bin/echo doesn't cause such strange behaviour. Actually, what causes problems here is something between a bug and a feature in libc's buffering. Basically the -EINVAL error causes libc to leave its data in the file-output buffer despite the file being closed and reopened. Since a standalone echo just exits that buffer is discarded, but for the bash builtin it hangs around in the buffer for a while and ends up getting prepended to the following echo statement. There's actually multiple ways to make this fail; this is just the simplest. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Security *Module* Framework (Was: LSM conversion to static interface)
On Oct 24, 2007, at 17:37:04, Serge E. Hallyn wrote: The scariest thing to consider is programs which don't appropriately handle failure. So I don't know, maybe the system runs a remote logger to which the multiadm policy gives some extra privs, but now the portac module prevents it from sending its data. And maybe, since the authors never saw this failure as possible, the program happens to dump sensitive data in a public readable place. I *could* be more vague but it'd be tough :) But you get the idea. Well, there *was* that problem with sendmail where it did not properly check the result of setuid() and just assumed it had succeeded. So instead of running as "smtpd" it was running as "root". Not a happy memory. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] stringbuf: A string buffer implementation
On Oct 24, 2007, at 17:21:10, Matthew Wilcox wrote: On Wed, Oct 24, 2007 at 04:59:48PM -0400, Kyle Moffett wrote: This seems unlikely to work reliably as the various "v*printf" functions modify the va_list argument they are passed. It may happen to work on your particular architecture depending on how that argument data is passed and stored, but you probably actually want to make a copy of the varargs list for the first vsnprintf call. I based what I did on how printk works: va_start(args, fmt); r = vprintk(fmt, args); va_end(args); It doesn't call va_* anywhere else. I don't claim to be a varargs expert, but if I'm wrong, I'm at least wrong the same way that printk is, so not in any way that's significant for any other architecture Linux runs on. No, the problem is what happens when you don't have enough space allocated: you call "vsnprintf(s, len, format, args);" and then later call "vsprintf(s, format, args);" with the *SAME* "args". That's what's broken. So this is wrong: va_list args; va_start(args, fmt); r1 = vprintk(fmt, args); r2 = vprintk(fmt, args); va_end(args); To fix it, you have 2 options. Option 1: va_list args; va_start(args, fmt); r1 = vprintk(fmt, args); va_end(args); va_start(args, fmt); r2 = vprintk(fmt, args); va_end(args); Option 2: va_list args, argscopy; va_start(args, fmt); va_copy(argscopy, args); r1 = vprintk(fmt, argscopy); va_end(argscopy); r2 = vprintk(fmt, args); va_end(args); Now in a function which *receives* a va_list from one of its callers, "Option 1" isn't an option because you don't have the original stack frame, so the result looks like this: void func1(const char *fmt, ...) { va_list ap; va_start(ap, fmt); func2(fmt, ap); va_end(ap); } void func2(const char *fmt, va_list ap) { va_list ap2; va_copy(ap2, ap); vprintk(fmt, ap2); va_end(ap2); vprintk(fmt, ap); } Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] stringbuf: A string buffer implementation
On Oct 24, 2007, at 15:59:49, Matthew Wilcox wrote: +static void sb_vprintf(struct stringbuf *sb, gfp_t gfp, const char *format, va_list args) +{ [...] + s = sb->buf + sb->len; + size = vsnprintf(s, sb->alloc - sb->len, format, args); [...] + /* Point to the end of the old string since we already updated - >len */ + s += sb->len - size; + vsprintf(s, format, args); [...] +void sb_printf(struct stringbuf *sb, gfp_t gfp, const char *format, ...) +{ + va_list args; + + va_start(args, format); + sb_vprintf(sb, gfp, format, args); + va_end(args); +} This seems unlikely to work reliably as the various "v*printf" functions modify the va_list argument they are passed. It may happen to work on your particular architecture depending on how that argument data is passed and stored, but you probably actually want to make a copy of the varargs list for the first vsnprintf call. Example below: va_list argscopy; va_copy(argscopy, args); [...] size = vsnprintf(s, sb->alloc - sb->len, format, argscopy) [...] s += sb->len - size; vsprintf(s, format, args); [...] va_end(argscopy); Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] stringbuf: A string buffer implementation
On Oct 24, 2007, at 17:21:10, Matthew Wilcox wrote: On Wed, Oct 24, 2007 at 04:59:48PM -0400, Kyle Moffett wrote: This seems unlikely to work reliably as the various v*printf functions modify the va_list argument they are passed. It may happen to work on your particular architecture depending on how that argument data is passed and stored, but you probably actually want to make a copy of the varargs list for the first vsnprintf call. I based what I did on how printk works: va_start(args, fmt); r = vprintk(fmt, args); va_end(args); It doesn't call va_* anywhere else. I don't claim to be a varargs expert, but if I'm wrong, I'm at least wrong the same way that printk is, so not in any way that's significant for any other architecture Linux runs on. No, the problem is what happens when you don't have enough space allocated: you call vsnprintf(s, len, format, args); and then later call vsprintf(s, format, args); with the *SAME* args. That's what's broken. So this is wrong: va_list args; va_start(args, fmt); r1 = vprintk(fmt, args); r2 = vprintk(fmt, args); va_end(args); To fix it, you have 2 options. Option 1: va_list args; va_start(args, fmt); r1 = vprintk(fmt, args); va_end(args); va_start(args, fmt); r2 = vprintk(fmt, args); va_end(args); Option 2: va_list args, argscopy; va_start(args, fmt); va_copy(argscopy, args); r1 = vprintk(fmt, argscopy); va_end(argscopy); r2 = vprintk(fmt, args); va_end(args); Now in a function which *receives* a va_list from one of its callers, Option 1 isn't an option because you don't have the original stack frame, so the result looks like this: void func1(const char *fmt, ...) { va_list ap; va_start(ap, fmt); func2(fmt, ap); va_end(ap); } void func2(const char *fmt, va_list ap) { va_list ap2; va_copy(ap2, ap); vprintk(fmt, ap2); va_end(ap2); vprintk(fmt, ap); } Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Security *Module* Framework (Was: LSM conversion to static interface)
On Oct 24, 2007, at 17:37:04, Serge E. Hallyn wrote: The scariest thing to consider is programs which don't appropriately handle failure. So I don't know, maybe the system runs a remote logger to which the multiadm policy gives some extra privs, but now the portac module prevents it from sending its data. And maybe, since the authors never saw this failure as possible, the program happens to dump sensitive data in a public readable place. I *could* be more vague but it'd be tough :) But you get the idea. Well, there *was* that problem with sendmail where it did not properly check the result of setuid() and just assumed it had succeeded. So instead of running as smtpd it was running as root. Not a happy memory. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] stringbuf: A string buffer implementation
On Oct 24, 2007, at 15:59:49, Matthew Wilcox wrote: +static void sb_vprintf(struct stringbuf *sb, gfp_t gfp, const char *format, va_list args) +{ [...] + s = sb-buf + sb-len; + size = vsnprintf(s, sb-alloc - sb-len, format, args); [...] + /* Point to the end of the old string since we already updated - len */ + s += sb-len - size; + vsprintf(s, format, args); [...] +void sb_printf(struct stringbuf *sb, gfp_t gfp, const char *format, ...) +{ + va_list args; + + va_start(args, format); + sb_vprintf(sb, gfp, format, args); + va_end(args); +} This seems unlikely to work reliably as the various v*printf functions modify the va_list argument they are passed. It may happen to work on your particular architecture depending on how that argument data is passed and stored, but you probably actually want to make a copy of the varargs list for the first vsnprintf call. Example below: va_list argscopy; va_copy(argscopy, args); [...] size = vsnprintf(s, sb-alloc - sb-len, format, argscopy) [...] s += sb-len - size; vsprintf(s, format, args); [...] va_end(argscopy); Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserve N process to root
On Oct 12, 2007, at 01:37:23, Al Boldi wrote: Kyle Moffett wrote: This isn't really necessary any more with the new CFS scheduler. If you want to prevent excess memory usage then you limit memory usage, not process count, so just set the system max process count to something absurdly high and leave the user counts down at the maximum a user might run. Then as long as the sum of the user processes is less than the max number of processes (which you just set absurdly high or unlimited), you may still log in. With the per-user scheduling enabled CFS allows you to run an optimistically-real-time game as one user and several thousand busy-loops as another user and get almost picture perfect 50% CPU distribution between the users. To me that seems a much better DoS- prevention system than limits which don't scale based on how many people are requesting resources. You have a point, and resource-controllers can probably control DoS a lot better, but the they also incur more overhead. Think of this "lockout prevention" patch as a near zero overhead safety valve. But why do you need to add "lockout prevention" if it already exists? With CFS' extremely efficient per-user-scheduling (hopefully soon to be the default) there are only two forms of lockout by non- root processes: (1) Running out of PIDs in the box's PID-space (think tens or hundreds of thousands of processes), or (2) Swap- storming the box to death. To put it bluntly trying to reserve free PID slots is attacking the wrong end of the problem and your so called "lockout prevention" could very easily ensure that 10 PIDs are available even if the user has swapstormed the box with the PIDs he does have. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserve N process to root
Please don't trim CC lists On Oct 11, 2007, at 17:02:37, Al Boldi wrote: David Newall wrote: [EMAIL PROTECTED] wrote: What David meant was that "root will always have a slot" doesn't *actually* help unless you *also* have a way to actually *spawn* such a process. In order to do the ps, kill, and so on that you need to recover, you need to already have either a root shell available, or a way to *get* a root shell that doesn't rely on a non-root process (so /bin/su doesn't help here). That's right, although it's worse than that. You need to have a process with CAP_SYS_ADMIN. If root processes normally have that capability then the reserved slots may well disappear before you notice a problem. If root processes normally don't have it, then you need to guarantee that one is already running. I once posted a patch to handle this DoS, but, as usual, it wasn't accepted. Go figure... This isn't really necessary any more with the new CFS scheduler. If you want to prevent excess memory usage then you limit memory usage, not process count, so just set the system max process count to something absurdly high and leave the user counts down at the maximum a user might run. Then as long as the sum of the user processes is less than the max number of processes (which you just set absurdly high or unlimited), you may still log in. With the per-user scheduling enabled CFS allows you to run an optimistically-real-time game as one user and several thousand busy-loops as another user and get almost picture perfect 50% CPU distribution between the users. To me that seems a much better DoS-prevention system than limits which don't scale based on how many people are requesting resources. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Oct 11, 2007, at 11:41:34, Casey Schaufler wrote: --- Kyle Moffett <[EMAIL PROTECTED]> wrote: [snipped] I'm still waiting to see the proposed SELinux policy that does what Smack does. That *is* the SELinux policy which does what Smack does. I keep having bugs in the perl-script I'm writing on account of not having the time to really get around to fixing it, but that is exactly the procedure for generating an SELinux policy from a SMACK policy. I can accept that you don't see anything that can't be implemented thus, but that's not the point. You've provided some really clear design notes, and that's great, but it ain't the code. You said that you could write a 500 line perl script that would do the whole thing, and that left some people with an impression that Smack is a subset of SELinux. Well, I'm already finding myself digging out from under that missunderstanding, and with people who are assuming that your policy has been done, "proving" the point. I'd love to have time to finish the script but unfortunately real life keeps interfering and I'm going to have to go back to lurking on this thread. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
Ok, finally getting some time to work on this stuff once again (life gets really crazy sometimes). I would like to postulate that you can restate any SMACK policy as a functionally equivalent SELinux policy (with a few slight technical differences, see below). I've been working on a script to do this but keep getting stuck tracking down minor bugs and then get dragged off on other things I need to do. Here is the method I am presently trying to implement: First divide the SELinux access vectors into 7 groups based on which ones SMACK wishes to influence: (R) Requires "read" permissions (the 'r' bit) (W) Requires "write" permissions (the 'w' bit) (X) Requires "execute" permissions (the 'x' bit) (A) Requires "append" OR "write" permissions (the 'a' bit) (P) Requires CAP_MAC_OVERRIDE (K) May not be performed by a non-CAP_MAC_OVERRIDE process on a CAP_MAC_OVERRIDE process (N) Does not require any special permissions The letters in front indicate the names I will use in the rest of this document to describe the sets of access vectors. Next define a single SELinux user "smack", and two independent roles, "priv" and "unpriv". We create the set of SMACK equivalence-classes defined as various SELinux types with substitutions for "*", "^", "_", and "?", and then completely omit the MLS portions of the SELinux policy. The next step is to establish the fundamental constraints of the policy. To prevent processes from gaining CAP_MAC_OVERRIDE we iterate over the access vectors in (K) and add the following constraint for each vector: constrain $OBJECT_CLASS $ACCESS_VECTOR ((r1 == r2) || (r1 == priv)) This also includes: constrain process transition ((r1 == r2) || (r1 == priv)) Then we require privilege to access the (P) vectors; for each vector in (P) we add a constraint: constrain $OBJECT_CLASS $ACCESS_VECTOR (r1 == priv) At this point the only rules left to add are the between-type rules. Here it gets mildly complicated because SMACK is a linear-lookup system (each rule must be matched in order) whereas SELinux is a globally-unique-lookup system (all rules are mutually exclusive and matched simultaneously). Essentially for each SMACK rule: $SOURCE $DEST $PERM_BITS We iterate over all of the classes represented in the access vector lists in $PERM_BITS and create rules for each one: allow { $SOURCE } { $DEST }:$PERM_CLASS { $PERM_VECTORS }; If you need SMACK to allow subtractive permissions then you need to expand that further, however I believe as an initial cut that it sufficient. The only other task is to prepend the auto-generated object-class and access-vector lists to the policy and append the initial SIDs that smack wants various objects to have, as well as allowing the "smack" user the "priv" and "nopriv" roles and allowing those two roles entry into all of the SMACK types. The resulting SELinux-ified SMACK labels would go from: SomeLabel (with CAP_MAC_OVERRIDE) AnotherLabel YetAnotherLabel to: smack:priv:SomeLabel smack:nopriv:AnotherLabel smack:nopriv:YetAnotherLabel Casey, hopefully this gives you some ideas about how I think you could modify the SELinux code to compile out the "user" field and simplify the "role" field as needed. I'm still not seeing anything which SELinux cannot directly implement without additional code, even the "CAP_MAC_OVERRIDE" bit. If the semantics don't seem quite right, please provide details about how you think the models differ and I will try to address the concerns. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: "mount --bind" with user/group/mode definition?
On Oct 11, 2007, at 04:35:37, Ph. Marek wrote: is there some way to duplicate a directory somewhere else (like with "mount --bind"), but having different owner/group/mode bits? I'd like to mount a directory I have no control over (think NFS, or floppy, ...) with clearly defined rights - like root:, mode 0550 for all directories, and 0440 for all files. (Here I want to have full *read* control, regardless of the original permissions). [ I know that this special case can be (mostly) done by a read-only binding mount; the part that is missing is eg. files with a different owner being 0700. ] I know that something like this is possible for eg. VFAT, which has no right descriptors for itself; but I'd need that for arbitrary directory trees, who themselves *have* permissions set. Is there some way to achieve that? Not at the moment, unfortunately. I suspect that with the recent developments in user container support and/or overlay mounting it will become possible to either write a UID/GID-translation overlay filesystem or grant cross-UID-container keys to achieve what you want. On the other hand that probably won't fully happen for up to a year or so. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: idio{,ma}tic typos (was Re: + fix-vm_can_nonlinear-check-in-sys_remap_file_pages.patch added to -mm tree)
On Oct 11, 2007, at 03:35:37, Alexey Dobriyan wrote: Sadly, yes. [PATCH] smctr: fix "|| 0x" typo IBM_PASS_SOURCE_ADDR is 1, so logically ORing it with status bits is pretty useless. Do bitwise OR, instead. Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]> --- drivers/net/tokenring/smctr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/tokenring/smctr.c +++ b/drivers/net/tokenring/smctr.c @@ -3413,7 +3413,7 @@ static int smctr_make_tx_status_code(struct net_device *dev, tsv->svi = TRANSMIT_STATUS_CODE; tsv->svl = S_TRANSMIT_STATUS_CODE; -tsv->svv[0] = ((tx_fstatus & 0x0100 >> 6) || IBM_PASS_SOURCE_ADDR); +tsv->svv[0] = ((tx_fstatus & 0x0100 >> 6) | IBM_PASS_SOURCE_ADDR); /* Stripped frame status of Transmitted Frame */ tsv->svv[1] = tx_fstatus & 0xff; Hmm, here's a question for you: The old code was equivalent to "tsv- >svv[0] = 1;", what's your proof that we don't rely on this "bug" elsewhere in the code? In other words, this is a significant behavior change (albeit fixing an apparent bug) from what we've done for a while. You might want to do a git-blame on this bit of code to see who the last person to modify it was and ask them to test or confirm the patch first. The same general questions apply to the other logical-op bugs. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: idio{,ma}tic typos (was Re: + fix-vm_can_nonlinear-check-in-sys_remap_file_pages.patch added to -mm tree)
On Oct 11, 2007, at 03:35:37, Alexey Dobriyan wrote: Sadly, yes. [PATCH] smctr: fix || 0x typo IBM_PASS_SOURCE_ADDR is 1, so logically ORing it with status bits is pretty useless. Do bitwise OR, instead. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- drivers/net/tokenring/smctr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/tokenring/smctr.c +++ b/drivers/net/tokenring/smctr.c @@ -3413,7 +3413,7 @@ static int smctr_make_tx_status_code(struct net_device *dev, tsv-svi = TRANSMIT_STATUS_CODE; tsv-svl = S_TRANSMIT_STATUS_CODE; -tsv-svv[0] = ((tx_fstatus 0x0100 6) || IBM_PASS_SOURCE_ADDR); +tsv-svv[0] = ((tx_fstatus 0x0100 6) | IBM_PASS_SOURCE_ADDR); /* Stripped frame status of Transmitted Frame */ tsv-svv[1] = tx_fstatus 0xff; Hmm, here's a question for you: The old code was equivalent to tsv- svv[0] = 1;, what's your proof that we don't rely on this bug elsewhere in the code? In other words, this is a significant behavior change (albeit fixing an apparent bug) from what we've done for a while. You might want to do a git-blame on this bit of code to see who the last person to modify it was and ask them to test or confirm the patch first. The same general questions apply to the other logical-op bugs. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mount --bind with user/group/mode definition?
On Oct 11, 2007, at 04:35:37, Ph. Marek wrote: is there some way to duplicate a directory somewhere else (like with mount --bind), but having different owner/group/mode bits? I'd like to mount a directory I have no control over (think NFS, or floppy, ...) with clearly defined rights - like root:some group, mode 0550 for all directories, and 0440 for all files. (Here I want to have full *read* control, regardless of the original permissions). [ I know that this special case can be (mostly) done by a read-only binding mount; the part that is missing is eg. files with a different owner being 0700. ] I know that something like this is possible for eg. VFAT, which has no right descriptors for itself; but I'd need that for arbitrary directory trees, who themselves *have* permissions set. Is there some way to achieve that? Not at the moment, unfortunately. I suspect that with the recent developments in user container support and/or overlay mounting it will become possible to either write a UID/GID-translation overlay filesystem or grant cross-UID-container keys to achieve what you want. On the other hand that probably won't fully happen for up to a year or so. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
Ok, finally getting some time to work on this stuff once again (life gets really crazy sometimes). I would like to postulate that you can restate any SMACK policy as a functionally equivalent SELinux policy (with a few slight technical differences, see below). I've been working on a script to do this but keep getting stuck tracking down minor bugs and then get dragged off on other things I need to do. Here is the method I am presently trying to implement: First divide the SELinux access vectors into 7 groups based on which ones SMACK wishes to influence: (R) Requires read permissions (the 'r' bit) (W) Requires write permissions (the 'w' bit) (X) Requires execute permissions (the 'x' bit) (A) Requires append OR write permissions (the 'a' bit) (P) Requires CAP_MAC_OVERRIDE (K) May not be performed by a non-CAP_MAC_OVERRIDE process on a CAP_MAC_OVERRIDE process (N) Does not require any special permissions The letters in front indicate the names I will use in the rest of this document to describe the sets of access vectors. Next define a single SELinux user smack, and two independent roles, priv and unpriv. We create the set of SMACK equivalence-classes defined as various SELinux types with substitutions for *, ^, _, and ?, and then completely omit the MLS portions of the SELinux policy. The next step is to establish the fundamental constraints of the policy. To prevent processes from gaining CAP_MAC_OVERRIDE we iterate over the access vectors in (K) and add the following constraint for each vector: constrain $OBJECT_CLASS $ACCESS_VECTOR ((r1 == r2) || (r1 == priv)) This also includes: constrain process transition ((r1 == r2) || (r1 == priv)) Then we require privilege to access the (P) vectors; for each vector in (P) we add a constraint: constrain $OBJECT_CLASS $ACCESS_VECTOR (r1 == priv) At this point the only rules left to add are the between-type rules. Here it gets mildly complicated because SMACK is a linear-lookup system (each rule must be matched in order) whereas SELinux is a globally-unique-lookup system (all rules are mutually exclusive and matched simultaneously). Essentially for each SMACK rule: $SOURCE $DEST $PERM_BITS We iterate over all of the classes represented in the access vector lists in $PERM_BITS and create rules for each one: allow { $SOURCE } { $DEST }:$PERM_CLASS { $PERM_VECTORS }; If you need SMACK to allow subtractive permissions then you need to expand that further, however I believe as an initial cut that it sufficient. The only other task is to prepend the auto-generated object-class and access-vector lists to the policy and append the initial SIDs that smack wants various objects to have, as well as allowing the smack user the priv and nopriv roles and allowing those two roles entry into all of the SMACK types. The resulting SELinux-ified SMACK labels would go from: SomeLabel (with CAP_MAC_OVERRIDE) AnotherLabel YetAnotherLabel to: smack:priv:SomeLabel smack:nopriv:AnotherLabel smack:nopriv:YetAnotherLabel Casey, hopefully this gives you some ideas about how I think you could modify the SELinux code to compile out the user field and simplify the role field as needed. I'm still not seeing anything which SELinux cannot directly implement without additional code, even the CAP_MAC_OVERRIDE bit. If the semantics don't seem quite right, please provide details about how you think the models differ and I will try to address the concerns. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Oct 11, 2007, at 11:41:34, Casey Schaufler wrote: --- Kyle Moffett [EMAIL PROTECTED] wrote: [snipped] I'm still waiting to see the proposed SELinux policy that does what Smack does. That *is* the SELinux policy which does what Smack does. I keep having bugs in the perl-script I'm writing on account of not having the time to really get around to fixing it, but that is exactly the procedure for generating an SELinux policy from a SMACK policy. I can accept that you don't see anything that can't be implemented thus, but that's not the point. You've provided some really clear design notes, and that's great, but it ain't the code. You said that you could write a 500 line perl script that would do the whole thing, and that left some people with an impression that Smack is a subset of SELinux. Well, I'm already finding myself digging out from under that missunderstanding, and with people who are assuming that your policy has been done, proving the point. I'd love to have time to finish the script but unfortunately real life keeps interfering and I'm going to have to go back to lurking on this thread. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserve N process to root
Please don't trim CC lists On Oct 11, 2007, at 17:02:37, Al Boldi wrote: David Newall wrote: [EMAIL PROTECTED] wrote: What David meant was that root will always have a slot doesn't *actually* help unless you *also* have a way to actually *spawn* such a process. In order to do the ps, kill, and so on that you need to recover, you need to already have either a root shell available, or a way to *get* a root shell that doesn't rely on a non-root process (so /bin/su doesn't help here). That's right, although it's worse than that. You need to have a process with CAP_SYS_ADMIN. If root processes normally have that capability then the reserved slots may well disappear before you notice a problem. If root processes normally don't have it, then you need to guarantee that one is already running. I once posted a patch to handle this DoS, but, as usual, it wasn't accepted. Go figure... This isn't really necessary any more with the new CFS scheduler. If you want to prevent excess memory usage then you limit memory usage, not process count, so just set the system max process count to something absurdly high and leave the user counts down at the maximum a user might run. Then as long as the sum of the user processes is less than the max number of processes (which you just set absurdly high or unlimited), you may still log in. With the per-user scheduling enabled CFS allows you to run an optimistically-real-time game as one user and several thousand busy-loops as another user and get almost picture perfect 50% CPU distribution between the users. To me that seems a much better DoS-prevention system than limits which don't scale based on how many people are requesting resources. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Reserve N process to root
On Oct 12, 2007, at 01:37:23, Al Boldi wrote: Kyle Moffett wrote: This isn't really necessary any more with the new CFS scheduler. If you want to prevent excess memory usage then you limit memory usage, not process count, so just set the system max process count to something absurdly high and leave the user counts down at the maximum a user might run. Then as long as the sum of the user processes is less than the max number of processes (which you just set absurdly high or unlimited), you may still log in. With the per-user scheduling enabled CFS allows you to run an optimistically-real-time game as one user and several thousand busy-loops as another user and get almost picture perfect 50% CPU distribution between the users. To me that seems a much better DoS- prevention system than limits which don't scale based on how many people are requesting resources. You have a point, and resource-controllers can probably control DoS a lot better, but the they also incur more overhead. Think of this lockout prevention patch as a near zero overhead safety valve. But why do you need to add lockout prevention if it already exists? With CFS' extremely efficient per-user-scheduling (hopefully soon to be the default) there are only two forms of lockout by non- root processes: (1) Running out of PIDs in the box's PID-space (think tens or hundreds of thousands of processes), or (2) Swap- storming the box to death. To put it bluntly trying to reserve free PID slots is attacking the wrong end of the problem and your so called lockout prevention could very easily ensure that 10 PIDs are available even if the user has swapstormed the box with the PIDs he does have. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Replace __attribute_pure__ with __pure
Trimmed the CC list a bit On Oct 05, 2007, at 20:51:21, H. Peter Anvin wrote: Ralf Baechle wrote: To be consistent with the use of attributes in the rest of the kernel replace all use of __attribute_pure__ with __pure and delete the definition of __attribute_pure__. Concern: __attribute_pure__ is very similar to __attribute_const__, which is almost completely, but not totally unlike the keyword "const"... Yes, there's also the fact that __pure is a reserved GCC keyword. Essentially according to GCC docs all of the GCC-specific keywords are equivalently defined as "keyword", "__keyword", and "__keyword__", with only the latter two defined in strict-ANSI mode. The following is valid according to GCC docs: static int __attribute__((__pure)) my_strlen(const char *str); With the proposed definition of __pure, that becomes a noticeably invalid __attribute__((__attribute__((__pure__ Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Replace __attribute_pure__ with __pure
Trimmed the CC list a bit On Oct 05, 2007, at 20:51:21, H. Peter Anvin wrote: Ralf Baechle wrote: To be consistent with the use of attributes in the rest of the kernel replace all use of __attribute_pure__ with __pure and delete the definition of __attribute_pure__. Concern: __attribute_pure__ is very similar to __attribute_const__, which is almost completely, but not totally unlike the keyword const... Yes, there's also the fact that __pure is a reserved GCC keyword. Essentially according to GCC docs all of the GCC-specific keywords are equivalently defined as keyword, __keyword, and __keyword__, with only the latter two defined in strict-ANSI mode. The following is valid according to GCC docs: static int __attribute__((__pure)) my_strlen(const char *str); With the proposed definition of __pure, that becomes a noticeably invalid __attribute__((__attribute__((__pure__ Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Oct 05, 2007, at 00:45:17, Eric W. Biederman wrote: Kyle Moffett <[EMAIL PROTECTED]> writes: On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote: SElinux is not all encompassing or it is generally incomprehensible I don't know which. Or someone long ago would have said a better way to implement containers was with a selinux ruleset, here is a selinux ruleset that does that. Although it is completely possible to implement all of the isolation with the existing LSM hooks as Serge showed. The difference between SELinux and containers is that SELinux (and LSM as a whole) returns -EPERM to operations outside the scope of the subject, whereas containers return -ENOENT (because it's not even in the same namespace). Yes. However if you look at what the first implementations were. Especially something like linux-vserver. All they provided was isolation. So perhaps you would not see every process ps but they all had unique pid values. I'm pretty certain Serge at least prototyped a simplified version of that using the LSM hooks. Is there something I'm not remember in those hooks that allows hiding of information like processes? Yes. Currently with containers we are taking that one step farther as that solves a wider set of problems. IMHO, containers have a subtly different purpose from LSM even though both are about information hiding. Basically a container is information hiding primarily for administrative reasons; either as a convenience to help prevent errors or as a way of describing administrative boundaries. For example, even in an environment where all sysadmins are trusted employees, a few head-honcho sysadmins would get root container access, and all others would get access to specific containers as a way of preventing "oops" errors. Basically a container is about "full access inside this box and no access outside". By contrast, LSM is more strictly about providing *limited* access to resources. For an accounting business all client records would grouped and associated together, however those which have passed this year's review are read-only except by specific staff and others may have information restricted to some subset of the employees. So containers are exclusive subsets of "the system" while LSM should be about non-exclusive information restriction. We also have in the kernel another parallel security mechanism (for what is generally a different class of operations) that has been quite successful, and different groups get along quite well, and ordinary mortals can understand it. The linux firewalling code. Well, I wouldn't go so far as the "ordinary mortals can understand it" part; it's still pretty high on the obtuse-o-meter. True. Probably a more accurate statement is:`unix command line power users can and do handle it after reading the docs. That's not quite ordinary mortals but it feels like it some days. It might all be perception... I have seen more *wrong* iptables firewalls than I've seen correct ones. Securing TCP/IP traffic properly requires either a lot of training/experience or a good out-of-the-box system like Shorewall which structures the necessary restrictions for you based on an abstract description of the desired functionality. For instance what percentage of admins do you think could correctly set up their netfilter firewalls to log christmas-tree packets, smurfs, etc without the help of some external tool? Hell, I don't trust myself to reliably do it without a lot of reading of docs and testing, and I've been doing netfilter firewalls for a while. The bottom line is that with iptables it is *CRITICAL* to have a good set of interface tools to take the users' "My system is set up like..." description in some form and turn it into the necessary set of efficient security rules. The *exact* same issue applies to SELinux, with 2 major additional problems: 1) Half the tools are still somewhat beta-ish and under heavy development. Furthermore the semi-official reference policy is nowhere near comprehensive and pretty ugly to read (go back to the point about the tools being beta-ish). 2) If you break your system description or translation tools then instead of just your network dying your entire *system* dies. The linux firewalling codes has hooks all throughout the networking stack, just like the LSM has hooks all throughout the rest of linux kernel. There is a difference however. The linux firewalling code in addition to hooks has tables behind those hooks that it consults. There is generic code to walk those tables and consult with different kernel modules to decide if we should drop a packet. Each of those kernel modules provides a different capability that can be used to generate a firewall. This is almost *EXACTLY* what
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
ir amount of what we need is already done in SELinux, and efforts would be better spent in figuring out what seems too complicated in SELinux and making it simpler. Probably a fair amount of that just means better tools. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
in SELinux, and efforts would be better spent in figuring out what seems too complicated in SELinux and making it simpler. Probably a fair amount of that just means better tools. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
On Oct 05, 2007, at 00:45:17, Eric W. Biederman wrote: Kyle Moffett [EMAIL PROTECTED] writes: On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote: SElinux is not all encompassing or it is generally incomprehensible I don't know which. Or someone long ago would have said a better way to implement containers was with a selinux ruleset, here is a selinux ruleset that does that. Although it is completely possible to implement all of the isolation with the existing LSM hooks as Serge showed. The difference between SELinux and containers is that SELinux (and LSM as a whole) returns -EPERM to operations outside the scope of the subject, whereas containers return -ENOENT (because it's not even in the same namespace). Yes. However if you look at what the first implementations were. Especially something like linux-vserver. All they provided was isolation. So perhaps you would not see every process ps but they all had unique pid values. I'm pretty certain Serge at least prototyped a simplified version of that using the LSM hooks. Is there something I'm not remember in those hooks that allows hiding of information like processes? Yes. Currently with containers we are taking that one step farther as that solves a wider set of problems. IMHO, containers have a subtly different purpose from LSM even though both are about information hiding. Basically a container is information hiding primarily for administrative reasons; either as a convenience to help prevent errors or as a way of describing administrative boundaries. For example, even in an environment where all sysadmins are trusted employees, a few head-honcho sysadmins would get root container access, and all others would get access to specific containers as a way of preventing oops errors. Basically a container is about full access inside this box and no access outside. By contrast, LSM is more strictly about providing *limited* access to resources. For an accounting business all client records would grouped and associated together, however those which have passed this year's review are read-only except by specific staff and others may have information restricted to some subset of the employees. So containers are exclusive subsets of the system while LSM should be about non-exclusive information restriction. We also have in the kernel another parallel security mechanism (for what is generally a different class of operations) that has been quite successful, and different groups get along quite well, and ordinary mortals can understand it. The linux firewalling code. Well, I wouldn't go so far as the ordinary mortals can understand it part; it's still pretty high on the obtuse-o-meter. True. Probably a more accurate statement is:`unix command line power users can and do handle it after reading the docs. That's not quite ordinary mortals but it feels like it some days. It might all be perception... I have seen more *wrong* iptables firewalls than I've seen correct ones. Securing TCP/IP traffic properly requires either a lot of training/experience or a good out-of-the-box system like Shorewall which structures the necessary restrictions for you based on an abstract description of the desired functionality. For instance what percentage of admins do you think could correctly set up their netfilter firewalls to log christmas-tree packets, smurfs, etc without the help of some external tool? Hell, I don't trust myself to reliably do it without a lot of reading of docs and testing, and I've been doing netfilter firewalls for a while. The bottom line is that with iptables it is *CRITICAL* to have a good set of interface tools to take the users' My system is set up like... description in some form and turn it into the necessary set of efficient security rules. The *exact* same issue applies to SELinux, with 2 major additional problems: 1) Half the tools are still somewhat beta-ish and under heavy development. Furthermore the semi-official reference policy is nowhere near comprehensive and pretty ugly to read (go back to the point about the tools being beta-ish). 2) If you break your system description or translation tools then instead of just your network dying your entire *system* dies. The linux firewalling codes has hooks all throughout the networking stack, just like the LSM has hooks all throughout the rest of linux kernel. There is a difference however. The linux firewalling code in addition to hooks has tables behind those hooks that it consults. There is generic code to walk those tables and consult with different kernel modules to decide if we should drop a packet. Each of those kernel modules provides a different capability that can be used to generate a firewall. This is almost *EXACTLY* what SELinux provides as an LSM module. The one difference is that with SELinux
Re: [RFC] New kernel-message logging API (take 2)
On Sep 28, 2007, at 03:31:11, Geert Uytterhoeven wrote: Can't you store the loglevel in the kprint_block and check it in all successive kprint_*() macros? If gcc knows it's constant, it can optimize the non-wanted code away. As other fields in struct kprint_block cannot be constant (they store internal state), you have to split it like: struct kprint_block { int loglevel; struct real_kprint_block real; /* internal state */ } and pass () instead of to all successive internal functions. I haven't tried this, so let's hope gcc is actually smart enough... Well actually, I believe you could just do: struct kprint_block { const int loglevel; [...]; }; Then cast away the constness to actually set it initially: *((int *)) = LOGLEVEL; Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] New kernel-message logging API (take 2)
On Sep 28, 2007, at 03:31:11, Geert Uytterhoeven wrote: Can't you store the loglevel in the kprint_block and check it in all successive kprint_*() macros? If gcc knows it's constant, it can optimize the non-wanted code away. As other fields in struct kprint_block cannot be constant (they store internal state), you have to split it like: struct kprint_block { int loglevel; struct real_kprint_block real; /* internal state */ } and pass block.real() instead of block to all successive internal functions. I haven't tried this, so let's hope gcc is actually smart enough... Well actually, I believe you could just do: struct kprint_block { const int loglevel; [...]; }; Then cast away the constness to actually set it initially: *((int *)block.loglevel) = LOGLEVEL; Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 4/4] sysfs: implement new features
On Sep 25, 2007, at 18:50:05, Greg KH wrote: On Thu, Sep 20, 2007 at 05:31:37PM +0900, Tejun Heo wrote: * Name-formatting for symlinks. e.g. symlink pointing to /dira/ dirb/leaf can be named as "symlink:%1-%0" and it will show up as "symlink:dirb-leaf". This only applies when new interface is used. Is this really necessary? It looks like we are adding a "special" type of parser here that no one uses. IMHO this would be nicer if it could reuse existing sprintf code to handle all the nice shiny sprintf format specifiers. The only challenge would be how to dynamically build a varargs list from an array of component names although perhaps there could be an internal __csprintf function which took a callback for retrieving arguments. Also since all of the path components are strings I don't know that numeric specifiers could be made useful, so perhaps it's not the greatest idea. I think the primary importance for this functionality is: * Autorenaming of symlinks according to the name format string when target or one of its ancestors is renamed or moved. This only applies when new interface is used. Nice. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Sep 27, 2007, at 17:34:45, Greg KH wrote: On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote: That fact that sysfs is all laid out in a directory, but for which some directories/symlinks are OK to use, and some are NOT OK to use --- is why I call the sysfs interface "an open pit". And because of the original design mistakes, we have only been able to change things for the better in a slow manner. We have had userspace programs fixed up for _years_ before we are able to make the corresponding changes in the kernel, so as to not break the distros that are slow to upgrade packages and kernels (like Debian.) Hey! No poking fingers at Debian here; it's been *MUCH* improved lately. I far more frequently have problems with boxes still running some ancient release of RHEL-4 or something than I do with those running Debian stable (virtually always the latest Debian stable). Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Sep 27, 2007, at 17:34:45, Greg KH wrote: On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote: That fact that sysfs is all laid out in a directory, but for which some directories/symlinks are OK to use, and some are NOT OK to use --- is why I call the sysfs interface an open pit. And because of the original design mistakes, we have only been able to change things for the better in a slow manner. We have had userspace programs fixed up for _years_ before we are able to make the corresponding changes in the kernel, so as to not break the distros that are slow to upgrade packages and kernels (like Debian.) Hey! No poking fingers at Debian here; it's been *MUCH* improved lately. I far more frequently have problems with boxes still running some ancient release of RHEL-4 or something than I do with those running Debian stable (virtually always the latest Debian stable). Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET 4/4] sysfs: implement new features
On Sep 25, 2007, at 18:50:05, Greg KH wrote: On Thu, Sep 20, 2007 at 05:31:37PM +0900, Tejun Heo wrote: * Name-formatting for symlinks. e.g. symlink pointing to /dira/ dirb/leaf can be named as symlink:%1-%0 and it will show up as symlink:dirb-leaf. This only applies when new interface is used. Is this really necessary? It looks like we are adding a special type of parser here that no one uses. IMHO this would be nicer if it could reuse existing sprintf code to handle all the nice shiny sprintf format specifiers. The only challenge would be how to dynamically build a varargs list from an array of component names although perhaps there could be an internal __csprintf function which took a callback for retrieving arguments. Also since all of the path components are strings I don't know that numeric specifiers could be made useful, so perhaps it's not the greatest idea. I think the primary importance for this functionality is: * Autorenaming of symlinks according to the name format string when target or one of its ancestors is renamed or moved. This only applies when new interface is used. Nice. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/25] Unionfs: add un/likely conditionals on copyup ops
On Sep 26, 2007, at 09:40:20, Erez Zadok wrote: In message <[EMAIL PROTECTED]>, "Kok, Auke" writes: I've been told several times that adding these is almost always bogus - either it messes up the CPU branch prediction or the compiler/CPU just does a lot better at finding the right way without these hints. Adding them as a blanket seems rather strange. Have you got any numbers that this really improves performance? Auke, that's a good question, but I found it hard to find any info about it. There's no discussion on it in Documentation/, and very little I could find elsewhere. I did see one url explaining what un/likely does precisely, but no guidelines. My understanding is that it can improve performance, as long as it's used carefully (otherwise it may hurt performance). Hmm, even still I agree with Auke, you probably use it too much. Recently we've done a full audit of the entire code, and added un/ likely where we felt that the chance of succeeding is 95% or better (e.g., error conditions that should rarely happen, and such). Actually due to the performance penalty on some systems I think you only want to use it if the chance of succeeding is 99% or better, as the benefit if predicted is a cycle or two and the harm if mispredicted can be more than 50 cycles, depending on the CPU. You should also remember than in filesystems many "failures" are triggered by things like the ld.so library searches, where it literally calls access() 20 different times on various possible paths for library files, failing the first 19. It does this once for each necessary library. Typically you only want to add unlikely() or likely() for about 2 reasons: (A) It's a hot path and the unlikely case is just going to burn a bunch of CPU anyways (B) It really is extremely unlikely that it fails (Think physical hardware failure) Anything else is just bogus. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Chroot bug
On Sep 26, 2007, at 09:11:33, Miloslav Semler wrote: + long directory_is_out(struct vfsmount *wdmnt, struct dentry *wdentry, + struct vfsmount *rootmnt, struct dentry *root) + { + struct nameidata oldentry, newentry; + long ret = 1; + + read_lock(>fs->lock); + oldentry.dentry = dget(wdentry); + oldentry.mnt = mntget(wdmnt); + read_unlock(>fs->lock); + newentry.dentry = oldentry.dentry; + newentry.mnt = oldentry.mnt; + + follow_dotdot(); + /* check it */ + if(newentry.dentry == root && + newentry.mnt == rootmnt){ + ret = 0; + goto out; + } + + while(oldentry.mnt != newentry.mnt || + oldentry.dentry != newentry.dentry){ + + memcpy(, , sizeof(struct nameidata)); + follow_dotdot(); + + /* check it */ + if(newentry.dentry == root && + newentry.mnt == rootmnt){ + ret = 0; + goto out; + } + } + out: + dput(newentry.dentry); + mntput(newentry.mnt); + return ret; + } This is basically both painfully racy and easily broken with umount and/or access to proc. See this busybox-compatible example: ## Set up chroot mkdir /root1 mount -o mode=0750 -t tmpfs tmpfs /root1 cp -a /bin/busybox /root1/busybox ## Enter chroot chroot /root1 /busybox ## Mount proc /busybox mkdir /proc /busybox mount -t proc proc /proc ## Poke around root filesystem (this may be all you need) /busybox ls /proc/1/root/ ## Detach our chroot so we're no longer a sub-directory /busybox umount -l /proc/1/root/root1 ## Now we can easily chroot to the original root, since it isn't in our ".." path exec /busybox chroot /proc/1/root /bin/sh See how easy that is? Unless you stick the above parent-directory check (which is still racy against directories being moved around) for *EVERY* directory component of *EVERY* open/chdir-ish syscall, you are still going to be easily worked around through many different methods. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Chroot bug
On Sep 26, 2007, at 06:27:38, David Newall wrote: Kyle Moffett wrote: David, please do tell myself and Adrian how "locking down" chroot () the way you want will avoid letting root break out through any of the above ways? As has been said, there are thousands of ways to break out of a chroot. It's just that one of them should not be that chroot lets you walk out. I can't explain it clearer than that. If you don't see it now you probably never will. Let me put it this way: You *CANNOT* enforce chroot() the way you want to without a completely unacceptable performance penalty. Let's start with the simplest example of: fd = open("/", O_DIRECTORY); chroot("/foo"); fchdir(fd); chroot("."); If you had ever actually looked at the Linux VFS, it is completely *impossible* to tell whether "fd" at the time of the chroot is inside or outside of "/foo" without tracking an enormous amount of extra state. Even then, any such determination may not be valid since an FD may be opened to an inode which is hardlinked at multiple locations in the directory tree. It could also be bind-mounted at multiple locations, or it may not even be mounted at all in this namespace (CDROM that was lazy-unmounted). That FD may be later passed over an open UNIX-domain socket from another process. Moreover, arbitrarily closing FDs would break a huge number of programs. Furthermore, since you can't fix the "trivial" case of 'fchdir()', then there's no point in even *attempting* to fix the "cwd is outside of chroot" problem, although that is basically equivalent in difficulty to fixing the "dir-fd is outside of chroot" problem. As for the nested-chroot() bit, the root user inside of a chroot is always allowed to chroot(). This is necessary for test-suites for various distro installers, chroot once to enter the installer playpen, installer chroots again to configure the test-installed- system. Once you allow a second chroot, you're back at the "can't reliably and efficiently track directory sub-tree members" problem. So if you think it can and should be fixed, then PROVIDE THE CODE. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Chroot bug
On Sep 26, 2007, at 06:27:38, David Newall wrote: Kyle Moffett wrote: David, please do tell myself and Adrian how locking down chroot () the way you want will avoid letting root break out through any of the above ways? As has been said, there are thousands of ways to break out of a chroot. It's just that one of them should not be that chroot lets you walk out. I can't explain it clearer than that. If you don't see it now you probably never will. Let me put it this way: You *CANNOT* enforce chroot() the way you want to without a completely unacceptable performance penalty. Let's start with the simplest example of: fd = open(/, O_DIRECTORY); chroot(/foo); fchdir(fd); chroot(.); If you had ever actually looked at the Linux VFS, it is completely *impossible* to tell whether fd at the time of the chroot is inside or outside of /foo without tracking an enormous amount of extra state. Even then, any such determination may not be valid since an FD may be opened to an inode which is hardlinked at multiple locations in the directory tree. It could also be bind-mounted at multiple locations, or it may not even be mounted at all in this namespace (CDROM that was lazy-unmounted). That FD may be later passed over an open UNIX-domain socket from another process. Moreover, arbitrarily closing FDs would break a huge number of programs. Furthermore, since you can't fix the trivial case of 'fchdir()', then there's no point in even *attempting* to fix the cwd is outside of chroot problem, although that is basically equivalent in difficulty to fixing the dir-fd is outside of chroot problem. As for the nested-chroot() bit, the root user inside of a chroot is always allowed to chroot(). This is necessary for test-suites for various distro installers, chroot once to enter the installer playpen, installer chroots again to configure the test-installed- system. Once you allow a second chroot, you're back at the can't reliably and efficiently track directory sub-tree members problem. So if you think it can and should be fixed, then PROVIDE THE CODE. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Chroot bug
On Sep 26, 2007, at 09:11:33, Miloslav Semler wrote: + long directory_is_out(struct vfsmount *wdmnt, struct dentry *wdentry, + struct vfsmount *rootmnt, struct dentry *root) + { + struct nameidata oldentry, newentry; + long ret = 1; + + read_lock(current-fs-lock); + oldentry.dentry = dget(wdentry); + oldentry.mnt = mntget(wdmnt); + read_unlock(current-fs-lock); + newentry.dentry = oldentry.dentry; + newentry.mnt = oldentry.mnt; + + follow_dotdot(newentry); + /* check it */ + if(newentry.dentry == root + newentry.mnt == rootmnt){ + ret = 0; + goto out; + } + + while(oldentry.mnt != newentry.mnt || + oldentry.dentry != newentry.dentry){ + + memcpy(oldentry, newentry, sizeof(struct nameidata)); + follow_dotdot(newentry); + + /* check it */ + if(newentry.dentry == root + newentry.mnt == rootmnt){ + ret = 0; + goto out; + } + } + out: + dput(newentry.dentry); + mntput(newentry.mnt); + return ret; + } This is basically both painfully racy and easily broken with umount and/or access to proc. See this busybox-compatible example: ## Set up chroot mkdir /root1 mount -o mode=0750 -t tmpfs tmpfs /root1 cp -a /bin/busybox /root1/busybox ## Enter chroot chroot /root1 /busybox ## Mount proc /busybox mkdir /proc /busybox mount -t proc proc /proc ## Poke around root filesystem (this may be all you need) /busybox ls /proc/1/root/ ## Detach our chroot so we're no longer a sub-directory /busybox umount -l /proc/1/root/root1 ## Now we can easily chroot to the original root, since it isn't in our .. path exec /busybox chroot /proc/1/root /bin/sh See how easy that is? Unless you stick the above parent-directory check (which is still racy against directories being moved around) for *EVERY* directory component of *EVERY* open/chdir-ish syscall, you are still going to be easily worked around through many different methods. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/25] Unionfs: add un/likely conditionals on copyup ops
On Sep 26, 2007, at 09:40:20, Erez Zadok wrote: In message [EMAIL PROTECTED], Kok, Auke writes: I've been told several times that adding these is almost always bogus - either it messes up the CPU branch prediction or the compiler/CPU just does a lot better at finding the right way without these hints. Adding them as a blanket seems rather strange. Have you got any numbers that this really improves performance? Auke, that's a good question, but I found it hard to find any info about it. There's no discussion on it in Documentation/, and very little I could find elsewhere. I did see one url explaining what un/likely does precisely, but no guidelines. My understanding is that it can improve performance, as long as it's used carefully (otherwise it may hurt performance). Hmm, even still I agree with Auke, you probably use it too much. Recently we've done a full audit of the entire code, and added un/ likely where we felt that the chance of succeeding is 95% or better (e.g., error conditions that should rarely happen, and such). Actually due to the performance penalty on some systems I think you only want to use it if the chance of succeeding is 99% or better, as the benefit if predicted is a cycle or two and the harm if mispredicted can be more than 50 cycles, depending on the CPU. You should also remember than in filesystems many failures are triggered by things like the ld.so library searches, where it literally calls access() 20 different times on various possible paths for library files, failing the first 19. It does this once for each necessary library. Typically you only want to add unlikely() or likely() for about 2 reasons: (A) It's a hot path and the unlikely case is just going to burn a bunch of CPU anyways (B) It really is extremely unlikely that it fails (Think physical hardware failure) Anything else is just bogus. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Chroot bug
On Sep 25, 2007, at 20:55:51, Adrian Bunk wrote: On Wed, Sep 26, 2007 at 09:20:54AM +0930, David Newall wrote: Good call. Though I suppose, since it's used 24x7 to aid security on countless production servers, that security dwarfs testing. Still, debugging, yes that's valid. Incompetent people implementing security solutions are a real problem. I don't suppose it makes and difference; whatever the purpose, a chroot that doesn't change the root is buggy. It does change the root. But it does not limit what the root user can do after the root was changed. This is required for most distro installers to work: *Procedure to install files* chroot /target mount -t proc proc /proc mount -t sysfs sysfs /sys mount -t tmpfs tmpfs /dev udevd --daemon udevtrigger udevsettle mount /dev/cdrom0 /media/cdrom0 *Load more kernel modules* *Procedure to configure newly-installed system* *Do other highly-privileged operations* *Configure networking and submit installation report* *Reboot* David, please do tell myself and Adrian how "locking down" chroot() the way you want will avoid letting root break out through any of the above ways? Hell, after you chroot one could probably just run: mount --bind /minimal_root /minimal_root cd /minimal_root mkdir old pivot_root . old cd /old mkdir old_minimal_root pivot_root . old_minimal_root umount /old_minimal_root rmdir /old_minimal_root Now, like magic, the entire system is once more accessible. Alternatively you could: mount -t proc proc /proc cat /proc/1/mounts mount -t $ROOTFS_FROM_PROC $ROOTDEV_FROM_PROC / Either way root can trivially break out of any chroot using FUNDAMENTAL PRIMITIVES that he/she always has access to. If you want to take those away you have to use SELinux or capabilities, in which case you could just take away the CAP_SYS_CHROOT capability in the first place! Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Fix coding style
On Sep 25, 2007, at 15:16:20, Ingo Oeser wrote: On Tuesday 25 September 2007, Srivatsa Vaddagiri wrote: @@ -297,7 +293,7 @@ static int __init init_sched_debug_procf pe->proc_fops = _debug_fops; #ifdef CONFIG_FAIR_USER_SCHED - pe = create_proc_entry("root_user_share", 0644, NULL); + pe = create_proc_entry("root_user_cpu_share", 0644, NULL); if (!pe) return -ENOMEM; What about moving this debug stuff under debugfs? Please consider using the functions in . They compile into nothing, if DEBUGFS is not compiled in and have already useful functions for reading/writing integers and booleans. Umm, that's not a debugging thing. It appears to be a tunable allowing you to configure what percentage of the total CPU that UID 0 gets which is likely to be useful to configure on production systems; at least until better group-scheduling tools are produced. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Fix coding style
On Sep 25, 2007, at 15:16:20, Ingo Oeser wrote: On Tuesday 25 September 2007, Srivatsa Vaddagiri wrote: @@ -297,7 +293,7 @@ static int __init init_sched_debug_procf pe-proc_fops = sched_debug_fops; #ifdef CONFIG_FAIR_USER_SCHED - pe = create_proc_entry(root_user_share, 0644, NULL); + pe = create_proc_entry(root_user_cpu_share, 0644, NULL); if (!pe) return -ENOMEM; What about moving this debug stuff under debugfs? Please consider using the functions in linux/debugfs.h. They compile into nothing, if DEBUGFS is not compiled in and have already useful functions for reading/writing integers and booleans. Umm, that's not a debugging thing. It appears to be a tunable allowing you to configure what percentage of the total CPU that UID 0 gets which is likely to be useful to configure on production systems; at least until better group-scheduling tools are produced. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Chroot bug
On Sep 25, 2007, at 20:55:51, Adrian Bunk wrote: On Wed, Sep 26, 2007 at 09:20:54AM +0930, David Newall wrote: Good call. Though I suppose, since it's used 24x7 to aid security on countless production servers, that security dwarfs testing. Still, debugging, yes that's valid. Incompetent people implementing security solutions are a real problem. I don't suppose it makes and difference; whatever the purpose, a chroot that doesn't change the root is buggy. It does change the root. But it does not limit what the root user can do after the root was changed. This is required for most distro installers to work: *Procedure to install files* chroot /target mount -t proc proc /proc mount -t sysfs sysfs /sys mount -t tmpfs tmpfs /dev udevd --daemon udevtrigger udevsettle mount /dev/cdrom0 /media/cdrom0 *Load more kernel modules* *Procedure to configure newly-installed system* *Do other highly-privileged operations* *Configure networking and submit installation report* *Reboot* David, please do tell myself and Adrian how locking down chroot() the way you want will avoid letting root break out through any of the above ways? Hell, after you chroot one could probably just run: mount --bind /minimal_root /minimal_root cd /minimal_root mkdir old pivot_root . old cd /old mkdir old_minimal_root pivot_root . old_minimal_root umount /old_minimal_root rmdir /old_minimal_root Now, like magic, the entire system is once more accessible. Alternatively you could: mount -t proc proc /proc cat /proc/1/mounts mount -t $ROOTFS_FROM_PROC $ROOTDEV_FROM_PROC / Either way root can trivially break out of any chroot using FUNDAMENTAL PRIMITIVES that he/she always has access to. If you want to take those away you have to use SELinux or capabilities, in which case you could just take away the CAP_SYS_CHROOT capability in the first place! Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote: On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote: But I compile net/* into bzImage. I like netbooting :) Isn't it possible to netboot with an initramfs image? I am pretty sure I have seen some systems do exactly that. Yeah, I've got Debian boxes that have never *not* netbooted (one Dell Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even load GRUB/LILO, although Windows somehow works fine). So they boot PXELinux using the PXE boot ROM on the NICs and it loads both a kernel and an initramfs into memory. Kernel is stock Debian and hardly has enough built-in to spit at you, let alone find network/ disks, but it manages to load everything it needs off the automagically-generated initramfs. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Uninline kcalloc()
On Sep 24, 2007, at 01:35:08, [EMAIL PROTECTED] wrote: On Sun, 23 Sep 2007 00:03:49 +0400, Alexey Dobriyan said: -static inline void *kcalloc(size_t n, size_t size, gfp_t flags) -{ - if (n != 0 && size > ULONG_MAX / n) - return NULL; - return __kmalloc(n * size, flags | __GFP_ZERO); -} +void *kcalloc(size_t n, size_t size, gfp_t flags); NAK. This busticates some pretty subtle code in mm/slab.c that uses uses __builtin_return_address() for debugging - if you do this, then the "calling function" gets listed as "kcalloc()" rather than the much more useful "function that called kcalloc()" (which is what you care about). (I remember going around and around multiple times getting those stupid inlines set up right, so that feature actually did something useful, otherwise kcalloc and kzalloc didn't report where they were called from). Proper fix is to give __kmalloc a "void *caller" parameter and have all of the various wrapper functions pass in the value of __builtin_return_address() appropriately. I believe that even works properly for inline functions which may or may not be inlined. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote: [EMAIL PROTECTED] (Mel Gorman) writes: On (16/09/07 23:58), Goswin von Brederlow didst pronounce: But when you already have say 10% of the ram in mixed groups then it is a sign the external fragmentation happens and some time should be spend on moving movable objects. I'll play around with it on the side and see what sort of results I get. I won't be pushing anything any time soon in relation to this though. For now, I don't intend to fiddle more with grouping pages by mobility for something that may or may not be of benefit to a feature that hasn't been widely tested with what exists today. I watched the videos you posted. A nice and quite clear improvement with and without your logic. Cudos. When you play around with it may I suggest a change to the display of the memory information. I think it would be valuable to use a Hilbert Curve to arange the pages into pixels. Like this: # # 0 3 # # ### 1 2 ### ### 0 1 E F # # ### ### 3 2 D C # # # ### # 4 7 8 B # # # # ### ### 5 6 9 A Here's an excellent example of an 0-255 numbered hilbert curve used to enumerate the various top-level allocations of IPv4 space: http://xkcd.com/195/ Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote: [EMAIL PROTECTED] (Mel Gorman) writes: On (16/09/07 23:58), Goswin von Brederlow didst pronounce: But when you already have say 10% of the ram in mixed groups then it is a sign the external fragmentation happens and some time should be spend on moving movable objects. I'll play around with it on the side and see what sort of results I get. I won't be pushing anything any time soon in relation to this though. For now, I don't intend to fiddle more with grouping pages by mobility for something that may or may not be of benefit to a feature that hasn't been widely tested with what exists today. I watched the videos you posted. A nice and quite clear improvement with and without your logic. Cudos. When you play around with it may I suggest a change to the display of the memory information. I think it would be valuable to use a Hilbert Curve to arange the pages into pixels. Like this: # # 0 3 # # ### 1 2 ### ### 0 1 E F # # ### ### 3 2 D C # # # ### # 4 7 8 B # # # # ### ### 5 6 9 A Here's an excellent example of an 0-255 numbered hilbert curve used to enumerate the various top-level allocations of IPv4 space: http://xkcd.com/195/ Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Uninline kcalloc()
On Sep 24, 2007, at 01:35:08, [EMAIL PROTECTED] wrote: On Sun, 23 Sep 2007 00:03:49 +0400, Alexey Dobriyan said: -static inline void *kcalloc(size_t n, size_t size, gfp_t flags) -{ - if (n != 0 size ULONG_MAX / n) - return NULL; - return __kmalloc(n * size, flags | __GFP_ZERO); -} +void *kcalloc(size_t n, size_t size, gfp_t flags); NAK. This busticates some pretty subtle code in mm/slab.c that uses uses __builtin_return_address() for debugging - if you do this, then the calling function gets listed as kcalloc() rather than the much more useful function that called kcalloc() (which is what you care about). (I remember going around and around multiple times getting those stupid inlines set up right, so that feature actually did something useful, otherwise kcalloc and kzalloc didn't report where they were called from). Proper fix is to give __kmalloc a void *caller parameter and have all of the various wrapper functions pass in the value of __builtin_return_address() appropriately. I believe that even works properly for inline functions which may or may not be inlined. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote: On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote: But I compile net/* into bzImage. I like netbooting :) Isn't it possible to netboot with an initramfs image? I am pretty sure I have seen some systems do exactly that. Yeah, I've got Debian boxes that have never *not* netbooted (one Dell Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even load GRUB/LILO, although Windows somehow works fine). So they boot PXELinux using the PXE boot ROM on the NICs and it loads both a kernel and an initramfs into memory. Kernel is stock Debian and hardly has enough built-in to spit at you, let alone find network/ disks, but it manages to load everything it needs off the automagically-generated initramfs. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/