[RFC] Best method to control a transmit-only mode on fiber NICs (specifically sky2)

2008-02-15 Thread Kyle Moffett
Hi,

The company I'm working for has an unusual fiber NIC configuration
that we use for one of our network appliances.  We connect only a
single fiber from the TX port on one NIC to the RX port on another
NIC, providing a physically-one-way path for enhanced security.
Unfortunately this doesn't work with most NIC drivers, as even with
auto-negotiation off they look for link probe pulses before they
consider the link up and are willing to send packets.  We have been
able to use Myricom 10GigE NICs with a custom firmware image.  More
recently we have patched the sky2 driver to turn on the FIB_FORCE_LNK
flag in the PHY control register; this seems to work on the
Marvell-chipset boards we have here.

What would be the preferred way to control this force link flag?
Right now we are accessing it using ethtool; we have added an
additional duplex mode: DUPLEX_TXONLY, with a value of 2.  When
you specify a speed and turn off autonegotiation (./patched-ethtool
-s eth2 speed 1000 autoneg off duplex txonly), it will turn on the
specified bit in the PHY control register and the link will
automatically come up.  We also have one related bug-fix^Wdirty hack
for sky2 to reset the PHY a second time during netif-up after enabling
interrupts; otherwise the immediate link up interrupt gets lost.
Once I get approval from the company I will patch the post itself for
review.

I look forward to your comments and suggestions

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET/IPv6] Race condition with flow_cache_genid?

2008-02-06 Thread Kyle Moffett
Whoops, I accidentally sent this to [EMAIL PROTECTED] instead of
[EMAIL PROTECTED]  Original email below:


Hi, I was poking around trying to figure out how to install the Mobile
IPv6 daemons this evening and noticed they required a kernel patch,
although upon further inspection the kernel patch seemed to already be
applied in 2.6.24.  Unfortunately the flow cache appears to be
horribly racy.  Attached below are the only uses of the variable
flow_cache_genid in 2.6.24.

Now, I am no expert in this particular area of the code, but the
atomic_t flow_cache_genid variable is ONLY ever used with
atomic_inc() and atomic_read().  There are no memory barriers or other
dec_and_test()-style functions, so that variable could just as easily
be replaced with a plain old C int.

Basically either there is some missing locking here or it does not
need to be atomic_t.  Judging from the way it *appears* to be used
to check if cache entries are up-to-date with the latest changes in
policy, I would guess the former.

In particular that whole flow_cache_lookup() thing looks racy as
hell, since somebody could be in the middle of that looking at if
(fle-genid == atomic_read(flow_cache_genid)).  It does the
atomic_read(), which BTW is literally implemented as:
  #define atomic_read(atomicvar) ((atomicvar)-value)
on some platforms.  Immediately after the atomic read (or even before,
since there's no cache-flush or read-modify-write), somebody calls
into selinux_xfrm_notify_policyload() and increments the
flow_cache_genid becase selinux just loaded a security policy.  Now
we're accepting a cache entry which applies to PREVIOUS security
policy.  I can only assume that's really bad.

Even worse, there seems to be a race between SELinux loading a new
policy and calling selinux_xfrm_notify_policyload(), since we could
easily get packets and process them according to the old cache entry
on one CPU before SELinux has had a chance to update the generation ID
from the other.  Furthermore, there's no guarantee the CPU caches will
get updated in reasonable time.  Clearly SELinux needs to have some
way of atomically invalidating the flow cache of all CPUs
*simultaneously* with loading a new policy, which probably means they
both need to be under the same lock, or something.

The same problem appears to occur with updating the XFRM policy and
incrementing flow_cache_genid.  Probably the fastest solution is to
put the flow cache under the xfrm_policy_lock (which already disables
local bottom-halves), and either take that lock during SELinux policy
load or if there are lock ordering problems then add a variable
flow_cache_ignore and change the xfrm_notify hooks:

void selinux_xfrm_notify_policyload_pre(void)
{
write_lock_bh(xfrm_policy_lock);
flow_cache_genid++;
flow_cache_ignore = 1;
write_unlock_bh(xfrm_policy_lock);
}

void selinux_xfrm_notify_policyload_post(void)
{
write_lock_bh(xfrm_policy_lock);
flow_cache_ignore = 0;
write_unlock_bh(xfrm_policy_lock);
}

Cheers,
Kyle Moffett


BEGIN QUOTED CODE INVOLVING flow_cache_genid:

include/net/flow.h:94:
extern atomic_t flow_cache_genid;

net/core/flow.c:39:
atomic_t flow_cache_genid = ATOMIC_INIT(0);

net/core/flow.c:169:flow_cache_lookup():
if (flow_hash_rnd_recalc(cpu))
flow_new_hash_rnd(cpu);
hash = flow_hash_code(key, cpu);

head = flow_table(cpu)[hash];
for (fle = *head; fle; fle = fle-next) {
if (fle-family == family 
fle-dir == dir 
flow_key_compare(key, fle-key) == 0) {
if (fle-genid == atomic_read(flow_cache_genid)) {
void *ret = fle-object;

if (ret)
atomic_inc(fle-object_ref);
local_bh_enable();

return ret;
}
break;
}
}

net/xfrm/xfrm_policy.c:1025:
int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
{
write_lock_bh(xfrm_policy_lock);
pol = __xfrm_policy_unlink(pol, dir);
write_unlock_bh(xfrm_policy_lock);
if (pol) {
if (dir  XFRM_POLICY_MAX)
atomic_inc(flow_cache_genid);
xfrm_policy_kill(pol);
return 0;
}
return -ENOENT;
}

net/ipv6/inet6_connection_sock.c:142:
static inline
void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
struct in6_addr *daddr, struct in6_addr *saddr)
{
__ip6_dst_store(sk, dst, daddr, saddr);

#ifdef CONFIG_XFRM
{
struct rt6_info *rt = (struct rt6_info  *)dst;
rt-rt6i_flow_cache_genid = atomic_read(flow_cache_genid);
}
#endif
}

security/selinux/include/xfrm.h:41:
static inline void selinux_xfrm_notify_policyload

Re: [PATCH 1/2] bnx2: factor out gzip unpacker

2007-09-24 Thread Kyle Moffett

On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote:

On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote:

But I compile net/* into bzImage. I like netbooting :)


Isn't it possible to netboot with an initramfs image?  I am pretty  
sure I have seen some systems do exactly that.


Yeah, I've got Debian boxes that have never *not* netbooted (one Dell  
Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even  
load GRUB/LILO, although Windows somehow works fine).  So they boot  
PXELinux using the PXE boot ROM on the NICs and it loads both a  
kernel and an initramfs into memory.  Kernel is stock Debian and  
hardly has enough built-in to spit at you, let alone find network/ 
disks, but it manages to load everything it needs off the  
automagically-generated initramfs.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-16 Thread Kyle Moffett
-accessed inode  
objects and creates non-fragmented copies before deleting the old ones.


There's a lot of other technical details which would need resolution  
in an actual implementation, but this is enough of a summary to give  
you the gist of the concept.  Most likely there will be some major  
flaw which makes it impossible to produce reliably, but the concept  
contains the things I would be interested in for a real networked  
filesystem.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:

On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:

On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

Why is all this fixation on volatile? I don't think people want  
volatile keyword per se, they want atomic_read(x) to _always_  
compile into an memory-accessing instruction, not register access.


and ... why is that?  is there any valid, non-buggy code sequence  
that makes that a reasonable requirement?


Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read(x))
continue;

gcc may happily convert it into:

reg = atomic_read(x);
while (reg)
continue;


Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
does the conversion, it may be that the CPU does the caching of the  
memory value.  GCC has no mechanism to do cache-flushes or memory- 
barriers except through our custom inline assembly.  Also, you  
probably want a cpu_relax() in there somewhere to avoid overheating  
the CPU.  Thirdly, on a large system it may take some arbitrarily  
large amount of time for cache-propagation to update the value of the  
variable in your local CPU cache.  Finally, if atomics are based on  
based on spinlock+interrupt-disable then you will sit in a tight busy- 
loop of spin_lock_irqsave()-spin_unlock_irqrestore().  Depending on  
your system's internal model this may practically lock up your core  
because the spin_lock() will take the cacheline for exclusive access  
and doing that in a loop can prevent any other CPU from doing any  
operation on it!  Since your IRQs are disabled you even have a very  
small window that an IRQ will come along and free it up long enough  
for the update to take place.


The earlier code segment of:

while(atomic_read(x)  0)
atomic_dec(x);
is *completely* buggy because you could very easily have 4 CPUs doing  
this on an atomic variable with a value of 1 and end up with it at  
negative 3 by the time you are done.  Moreover all the alternatives  
are also buggy, with the sole exception of this rather obvious- 
seeming one:

atomic_set(x, 0);


You simply CANNOT use an atomic_t as your sole synchronizing  
primitive, it doesn't work!  You virtually ALWAYS want to use an  
atomic_t in the following types of situations:


(A) As an object refcount.  The value is never read except as part of  
an atomic_dec_return().  Why aren't you using struct kref?


(B) As an atomic value counter (number of processes, for example).   
Just reading the value is racy anyways, if you want to enforce a  
limit or something then use atomic_inc_return(), check the result,  
and use atomic_dec() if it's too big.  If you just want to return the  
statistics then you are going to be instantaneous-point-in-time anyways.


(C) As an optimization value (statistics-like, but exact accuracy  
isn't important).


Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
useful for synchronization, only for keeping track of simple integer  
RMW values.  Note that atomic_read() and atomic_set() aren't very  
useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
write).  Code which assumes anything else is probably buggy in other  
ways too.


So while I see no real reason for the volatile on the atomics, I  
also see no real reason why it's terribly harmful.  Regardless of the  
volatile on the operation the CPU is perfectly happy to cache it  
anyways so it doesn't buy you any actual always-access-memory  
guarantees.  If you are just interested in it as an optimization you  
could probably just read the properly-aligned integer counter  
directly, an atomic read on most CPUs.


If you really need it to hit main memory *every* *single* *time*  
(Why?  Are you using it instead of the proper kernel subsystem?)   
then you probably need a custom inline assembly helper anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 12:46:33, Denys Vlasenko wrote:
My point is that people are confused as to what atomic_read()   
exactly means, and this is bad. Same for cpu_relax().  First one  
says read, and second one doesn't say barrier.


QA:

Q:  When is it OK to use atomic_read()?
A:  You are asking the question, so never.

Q:  But I need to check the value of the atomic at this point in time...
A:  Your code is buggy if it needs to do that on an atomic_t for  
anything other than debugging or optimization.  Use either  
atomic_*_return() or a lock and some normal integers.


Q:  So why can't the atomic_read DTRT magically?
A:  Because the right thing depends on the situation and is usually  
best done with something other than atomic_t.


If somebody can post some non-buggy code which is correctly using  
atomic_read() *and* depends on the compiler generating extra  
nonsensical loads due to volatile then the issue *might* be  
reconsidered.  This also includes samples of code which uses  
atomic_read() and needs memory barriers (so that we can fix the buggy  
code, not so we can change atomic_read()).  So far the only code  
samples anybody has posted are buggy regardless of whether or not the  
value and/or accessors are flagged volatile or not.  And hey, maybe  
the volatile ops *should* be implemented in inline ASM for future- 
proof-ness, but that's a separate issue.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DRIVER SUBMISSION] DRBD wants to go mainline

2007-07-25 Thread Kyle Moffett

On Jul 25, 2007, at 22:03:37, [EMAIL PROTECTED] wrote:

On Wed, 25 Jul 2007, Satyam Sharma wrote:

On 7/25/07, Lars Ellenberg [EMAIL PROTECTED] wrote:

On Wed, Jul 25, 2007 at 04:41:53AM +0530, Satyam Sharma wrote:

[...]
But where does the send come into the picture over here -- a  
send won't block forever, so I don't foresee any issues  
whatsoever w.r.t.  kthreads conversion for that. [ BTW I hope  
you're *not* using any signals-based interface for your kernel  
thread _at all_. Kthreads disallow (ignore) all signals by  
default, as they should, and you really shouldn't need to write  
any logic to handle or   do-certain-things-on-seeing a signal  
in a well designed kernel thread. ] and the sending latency is  
crucial to performance, while the recv will not timeout for the  
next few seconds.  Again, I don't see what sending latency has  
to do with a kernel_thread to kthread conversion. Or with  
signals, for that matter. Anyway, as Kyle Moffett mentioned  
elsewhere, you could probably look at other examples (say  
cifs_demultiplexer_thread() in fs/cifs/connect.c).


the basic problem, and what we use signals for, is:  it is  
waiting in recv, waiting for the peer to say something.  but I  
want it to stop recv, and go send something right now.


That's ... weird. Most (all?) communication between any two  
parties would follow a protocol where someone recv's stuff, does  
something with it, and sends it back ... what would you send  
right now if you didn't receive anything?


becouse even though you didn't receive anything you now have  
something important to send.


remember that both sides can be sitting in receive mode. this puts  
them both in a position to respond to the other if the other has  
something to say.


Why not just have 2 threads, one for sending and one for  
receiving.  When your receiving thread gets data it takes  
appropriate locks and processes it, then releases the locks and goes  
back to waiting for packets.  Your sending thread would take  
appropriate locks, generate data to send, release locks, and transmit  
packets.  You don't have to interrupt the receive thread to send  
packets, so where's the latency problem, exactly?


If I were writing that in userspace I would have:

(A) The pool of IO-generating threads (IE: What would ordinarily be  
userspace)

(B) One or a small number of data-reception threads.
(C) One or a small number of data-transmission threads.

When you get packets to process in your network-reception thread(s),  
you queue appropriate disk IOs and any appropriate responses with  
your transmission thread(s).  You can basically just sit in a loop on  
tcp_recvmsg=demultiplex=do-stuff.  When your IO-generators actually  
make stuff to send you queue such data for disk IO, then packetize it  
and hand it off to your data-transmission threads.


If you made all your sockets and inter-thread pipes nonblocking then  
in userspace you would just epoll_wait() on the sockets and pipes and  
be easily able to react to any IO from anywhere.


In kernel space there are similar nonblocking interfaces, although it  
would probably be easier just to use a couple threads.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [DRIVER SUBMISSION] DRBD wants to go mainline

2007-07-23 Thread Kyle Moffett
For the guys on netdev, would you please look at the tcp_recvmsg- 
threading and TCP_NAGLE_CORK issues below and give opinions on the  
best way to proceed?


One thing to remember, you don't necessarily have to merge every  
feature right away.  As long as the new code is configured off by  
default with an (EXPERIMENTAL) warning, you can start getting the  
core parts and the cleanups upstream before you have to resolve all  
the issues with low-latency, dynamic-tracing-frameworks, etc.


On Jul 23, 2007, at 09:32:02, Lars Ellenberg wrote:

On Sun, Jul 22, 2007 at 09:32:02PM -0400, Kyle Moffett wrote:

+/* I don't remember why XCPU ...
+ * This is used to wake the asender,
+ * and to interrupt sending the sending task
+ * on disconnect.
+ */
+#define DRBD_SIG SIGXCPU


Don't use signals between kernel threads, use proper primitives  
like notifiers and waitqueues, which means you should also  
probably switch away from kernel_thread() to the kthread_*()  
APIs.  Also you should fix this FIXME or remove it if it no longer  
applies:-D.


right.
but how to I tell a network thread in tcp_recvmsg to stop early,  
without using signals?


I'm not really a kernel-networking guy, so I can't answer this  
definitively, but I'm pretty sure the problem has been solved in many  
network filesystems and such, so I've added a netdev CC.  The way I'd  
do it in userspace is with nonblocking IO and epoll(), that way I  
don't actually have to stop or signal the thread, I can just add  
a socket to epoll fd when I want to pay attention to it, and remove  
it from my epoll fd when I'm done with it.  I'd assume there's some  
equivalent way in kernelspace based around the struct kiocb *iocb  
and int nonblock parameters to the tcp_recvmsg() kernel function.



+/* see kernel/printk.c:printk_ratelimit
+ * macro, so it is easy do have independend rate limits at  
different locations

+ * initializer element not constant ... with kernel 2.4 :(
+ * so I initialize toks to something large
+ */
+#define DRBD_ratelimit(ratelimit_jiffies, ratelimit_burst) \

Any particular reason you can't just use printk_ratelimit for this?


I want to be able to do a rate-limit per specific message/code  
fragment, without affecting other messages or execution paths.


Ok, so could you change your patch to modify __printk_ratelimit() to  
also accept a struct printk_rate datastructure and make  
printk_ratelimit() call __printk_ratelimit(global_printk_rate);??


Typically if $KERNEL_FEATURE is insufficient for your needs you  
should fix $KERNEL_FEATURE instead of duplicating a replacement in  
your driver.  This applies to basically all of the things I'm talking  
about, kernel-threads, workqueues (BTW: I believe you can make your  
own custom workqueue thread(s) instead of using the default events/ 
* ones), debugging macros, fault-insertion, integer math, lock- 
checking, dynamic tracing, etc.  If you find some reason that some  
generic code won't work for you, please try to fix it first so we can  
all benefit from it.


Umm, how about fixing this to actually use proper workqueues or  
something instead of this open-coded mess?


unlikely to happen right now.  but it is on our todo list...


Unfortunately problems like these need to be fixed before a mainline  
merge.  Merging duplicated code is a big no-no, and historically  
there have been problems with people who merge code and never  
properly maintain it once it's in tree.  As a result the rule is your  
code has to be easily maintainable before anybody will even  
*consider* merging it.



+/* I want the packet to fit within one page
+ * THINK maybe use a special bitmap header,
+ * including offset and compression scheme and whatnot
+ * Do not use PAGE_SIZE here! Use a architecture agnostic constant!
+ */
+#define BM_PACKET_WORDS ((4096-sizeof(struct Drbd_Header))/sizeof 
(long))


Yuck.  Definitely use PAGE_SIZE here, so at least if it's broken  
on an arch with multiple page sizes, somebody can grep for  
PAGE_SIZE to fix it.  It also means that on archs/configs with 8k  
or 64k pages you won't waste a bunch of memory.


No. This is not to allocate anything, but defines the chunk size  
with which we transmit the bitmap, when we have to.  We need to be  
able to talk from one arch (say i586) to some other (say s390, or  
sparc, or whatever).  The receiving side has a one-page buffer,  
from which it may or may not to endian-conversion.  The hardcoded  
4096 is the minimal common denominator here.


Ahhh.  Please replace the constant 4096 with:
/* This is the maximum amount of bitmap we will send per packet */
# define MAX_BITMAP_CHUNK_SIZE 4096
# define BM_PACKET_WORDS \
((MAX_BITMAP_CHUNK_SIZE - sizeof(struct Drbd_Header))/sizeof(long))

It's more text but dramatically improves the readability by  
eliminating more magic numbers.  This is a much milder case than I've  
seen in the past, so it's not that big of a deal.




+/* Dynamic tracing framework */
guess we

Re: PM policy, hotplug, power saving (was Re: [PATCH] b44: power down PHY when interface down)

2007-06-30 Thread Kyle Moffett

On Jun 30, 2007, at 12:42:06, Jeff Garzik wrote:
Definitely matters.  Switch renegotiation can take a while, and you  
must take into account the common case of interface bouncing  
(immediate down, then up).


Hoards actively complained the few times we experimented with this,  
because of e.g. DHCP's habit of bouncing the interface, which  
resulted in PHY power bouncing, which resulted in negotiation,  
which resulted in an excrutiating wait on various broken or stupid  
switches.


Overall, this may be classed with other problems of a similar  
sort:  we can power down a PHY, but that removes hotplug capability  
and extends partner/link negotiation time.


Like SATA, we actually want to support BOTH -- active hotplug and  
PHY power-down -- and so this wanders into power management policy.


Give me a knob, and we can program plenty of ethernet|SATA|USB|...  
drivers to power down the PHY and save power.


With some buggy switches and other hardware you actually *want* to  
bounce the link to get them to properly renegotiate.  I can also see  
wanting to power off and on a single-PoE-port NIC to restart whatever  
device is at the other end, although I don't know if any such devices  
exist.  Currently the tg3 driver turns the PHY off and on during down/ 
up on a few of my systems, which I use to make a buggy no-name switch  
recognize STP changes properly.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scaling Max IP address limitation

2007-06-24 Thread Kyle Moffett

On Jun 24, 2007, at 13:20:01, David Jones wrote:
Hi, I am trying to add multiple IP addresses ( v6 ) to my FC7 box  
on eth0. But I am hitting a max limit of 4000 IP address . Seems  
like there is a limiting variable in linux kernel (which one? )  
that prevents from adding more IP addresses than 4096. What do I  
need to change in Linux kernel  ( and then recompile ) to be able  
to add more IP addresses than 4K addresses per system? ..


Do you really need that many IP addresses?  When somebody finally  
gets around to implementing REDIRECT support for ip6tables then you  
could just redirect them all to the same port on the local system.   
Then with a happy little getsockopt() you can find out the original  
IP address for use in whatever application you are running.  That's  
likely to be a thousand times more efficient than binary searching  
through 5000+ mostly-sequential IP addresses per received packet.


Unrelated wishful thinking
I keep having hopeful dreams that one day netfilter will grow support  
for cross-protocol NAT (IE: NAT a TCPv4 connection over TCPv6 to the  
IPv6-only local web server, or vice versa).  It would seem that would  
require a merged xtables program.


Having routing table operations, IPsec transformations, etc just be  
another step in the firewall rules would also be useful.  It would be  
handy to be able to -j ROUTE, then -j IPSEC, then -j ROUTE  
again, to re-route the now-encapsulated IPsec packets to their proper  
destination.  That would also eliminate the sort-of-hacky problems  
with destination network interface in the bridging code: -j BRIDGE  
might be another step in the process, and conceivably you could have  
independent bridge MAC tables too.  You'd probably also want -j  
BRIDGE_TEST and -j ROUTE_TEST to compute the output network  
interface without actually modifying the addresses.


That would also appear to get rid of the need for all tables other  
than filter and all predefined chains other than INPUT and  
OUTPUT.  Default rules would be these:

nettables -A INPUT -j CONNTRACK
nettables -A INPUT -j LOCALMATCH
nettables -A INPUT --for-this-host -j ACCEPT
nettables -A INPUT -j OUTPUT
nettables -A OUTPUT -j ROUTE
nettables -A OUTPUT -j TRANSMIT

Forwarded packets would be sent right into the OUTPUT chain from the  
INPUT chain by appropriate rules.  Instead of turning off  
ip_forwarding in /proc/sys, you could just change the -j OUTPUT in  
the INPUT chain to -j ACCEPT, and it would be impossible to forward  
packets.  I can't see any functionality that we have today which a  
mechanism like this wouldn't support, aside from the fact that it  
hands the admin a loaded nuclear missile aimed at their foot  
(Flushing the INPUT chain would basically be analogous to committing  
network suicide, although there exist other ways to do that with  
netfilter today.

/Unrelated wishful thinking

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel

2007-05-12 Thread Kyle Moffett

On May 11, 2007, at 01:49:27, Kyle Moffett wrote:

On May 10, 2007, at 00:34:11, Kyle Moffett wrote:

On May 10, 2007, at 00:25:54, Ben Greear wrote:
Looks like a deadlock in the vlan code.  Any chance you can run  
this test with lockdep enabled?


You could also add a printk in vlan_device_event() to check which  
event it is hanging on, and the netdevice that is passed in.


Ok, I'll try building a 2.6.21 kernel with lockdep and some  
debugging printk()s in the vlan_device_event() function and get  
back to you tomorrow.  Thanks for the quick response!


[snip]

ifup -a brings up the interfaces in this order (See previous email  
for configuration details):

lo net0 wfi0 world0 lan lan:0 world

ifdown -a appears to bring them down in the same order (at least,  
until it gets stuck).


Hmm, turns out that it always hung downing this entry in my  
interfaces file, independent of ordering:


iface world0 inet manual
mac-address 8b:8d:cb:91:e2:4c
minimally-up yes
vlan-dev net0
vlan-id 4094

By commenting out the MAC address line it worked.  Yes, I realize the  
MAC address specified there is bogus, I managed to {think,type}o that  
one somehow.  I had been intending to specify a locally-allocated  
virtual MAC address on world0 but instead I managed to somehow assign  
one with the MAC multicast bit set (01:00:00:00:00:00)
If I change the above garbage MAC to 02:00:00:00:00:01 (first 02 is  
the locally-administrated bit) then it seems to work perfectly fine,   
My guess that the bridging code doesn't properly drop all references  
to world0 when it has that garbage MAC address on it (since the  
problem only shows up when both the invalid mac-address is present  
*AND* I start the world bridge).  I suppose this isn't really a big  
problem, but it would be nice if things didn't leak refcounts on  
invalid input.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel

2007-05-10 Thread Kyle Moffett

On May 10, 2007, at 00:34:11, Kyle Moffett wrote:

On May 10, 2007, at 00:25:54, Ben Greear wrote:
Looks like a deadlock in the vlan code.  Any chance you can run  
this test with lockdep enabled?


You could also add a printk in vlan_device_event() to check which  
event it is hanging on, and the netdevice that is passed in.


Ok, I'll try building a 2.6.21 kernel with lockdep and some  
debugging printk()s in the vlan_device_event() function and get  
back to you tomorrow.  Thanks for the quick response!


Progress!!!  I built a 2.6.21.1 kernel with a 1MB dmesg buffer,  
almost all of the locking debugging options on (as well as a few  
others just for kicks), a VLAN debug #define turned on in the net/ 
8021q/vlan.h file, and lots of extra debugging messages added to the  
functions in vlan.c.  My initial interpretation is that due to the  
funny order in which ifdown -a takes down interfaces, it tries to  
delete the VLAN interfaces before the bridges running atop them have  
been taken down.  Ordinarily this seems to work, but when the  
underlying physical ethernet is down already, the last VLAN to be  
deleted seems to hang somehow.  The full results are as follows:


The lock dependency validator at startup passes all 218 testcases,  
indicating that all the locking crap is probably working correctly  
(those debug options chew up another meg of RAM).


ifup -a brings up the interfaces in this order (See previous email  
for configuration details):

lo net0 wfi0 world0 lan lan:0 world

ifdown -a appears to bring them down in the same order (at least,  
until it gets stuck).


Attached below is filtered debugging information.  I cut out 90% of  
the crap in the syslog, but there's still a lot left over to sift  
through; sorry.  If you want my .config or the full text of the log  
then email me privately and I'll send it to you, as it's kinda big.


I appreciate any advice, thanks for all your help

Cheers,
Kyle Moffett

This first bit is the ifup -a -v -i interfaces:

ADDRCONF(NETDEV_UP): net0: link is not ready
vlan_ioctl_handler: args.cmd: 6
vlan_ioctl_handler: args.cmd: 0
register_vlan_device: if_name -:net0:-^Ivid: 2
About to allocate name, vlan_name_type: 3
Allocated new name -:net0.2:-
About to go find the group for idx: 2
vlan_transfer_operstate: net0 state transition applies to net0.2 too:
vlan_proc_add, device -:net0.2:- being added.
Allocated new device successfully, returning.
wfi0: add 33:33:00:00:00:01 mcast address to master interface
wfi0: add 01:00:5e:00:00:01 mcast address to master interface
ADDRCONF(NETDEV_UP): wfi0: link is not ready
vlan_ioctl_handler: args.cmd: 6
vlan_ioctl_handler: args.cmd: 0
register_vlan_device: if_name -:net0:-^Ivid: 4094
About to allocate name, vlan_name_type: 3
Allocated new name -:net0.4094:-
About to go find the group for idx: 2
vlan_transfer_operstate: net0 state transition applies to net0.4094  
too:

vlan_proc_add, device -:net0.4094:- being added.
Allocated new device successfully, returning.
world0: add 33:33:00:00:00:01 mcast address to master interface
world0: add 01:00:5e:00:00:01 mcast address to master interface
ADDRCONF(NETDEV_UP): world0: link is not ready
tg3: net0: Link is up at 1000 Mbps, full duplex.
tg3: net0: Flow control is on for TX and on for RX.
ADDRCONF(NETDEV_CHANGE): net0: link becomes ready
Propagating NETDEV_CHANGE for device net0...
... to wfi0
vlan_transfer_operstate: net0 state transition applies to  
wfi0 too:

...found a carrier, applying to VLAN device
... to world0
vlan_transfer_operstate: net0 state transition applies to  
world0 too:

...found a carrier, applying to VLAN device
lan: port 1(net0) entering listening state
ADDRCONF(NETDEV_CHANGE): wfi0: link becomes ready
wfi0: dev_set_promiscuity(master, 1)
wfi0: add 33:33:ff:5f:60:92 mcast address to master interface
lan: port 2(wfi0) entering listening state
ADDRCONF(NETDEV_CHANGE): world0: link becomes ready
world0: add 33:33:ff:91:e2:4c mcast address to master interface
lan: no IPv6 routers present
world: no IPv6 routers present
net0: no IPv6 routers present
world0: no IPv6 routers present
wfi0: no IPv6 routers present
lan: port 1(net0) entering learning state
lan: port 2(wfi0) entering learning state
lan: topology change detected, propagating
lan: port 1(net0) entering forwarding state
lan: topology change detected, propagating
lan: port 2(wfi0) entering forwarding state


This bit is for ifdown -a -v -i interfaces:

Propagating NETDEV_DOWN for device net0...
... to wfi0
wfi0: del 33:33:ff:5f:60:92 mcast address from vlan interface
wfi0: del 33:33:ff:5f:60:92 mcast address from master interface
wfi0: del 01:00:5e:00:00:01 mcast address from vlan interface
wfi0: del 01:00:5e:00:00:01 mcast address from master interface
wfi0: del 33:33:00:00:00:01 mcast address from vlan interface
wfi0: del 33:33:00:00:00:01 mcast address from master interface
lan: port 2(wfi0) entering disabled state
... to world0
world0

[BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel

2007-05-09 Thread Kyle Moffett
 script in charge  
of disassembling VLAN interfaces.  There is an equivalent zz-km- 
bridge script for bridge interfaces, as well as if-pre-up.d scripts  
called 00-km-vlan and 00-km-bridge to create the interfaces.


If anyone has any suggestions, patches, or debugging tips I'm very  
interested to hear from you.  Thanks!


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG][debian-2.6.20-1-686] bridging + vlans + vconfig rem == stuck kernel

2007-05-09 Thread Kyle Moffett

On May 10, 2007, at 00:25:54, Ben Greear wrote:

Kyle Moffett wrote:
vconfig   D 83CCD8CE 0 16564  16562  
(NOTLB)
   efdd7e7c 0086 ee120afb 83ccd8ce 98f00788 b7083ffa  
5384b49a c76c0b05
   9ebaf791 0004 efdd7e4e 0007 f1468a90 2ab74174  
0362 0326
   f1468b9c c180e420 0001 0286 c012933c efdd7e8c  
df98a000 c180e468

Call Trace:
[c012933c] lock_timer_base+0x15/0x2f
[c0129445] __mod_timer+0x91/0x9b
[c02988f5] schedule_timeout+0x70/0x8d
[f8b75209] vlan_device_event+0x13/0xf8 [8021q]


Looks like a deadlock in the vlan code.  Any chance you can run  
this test with lockdep enabled?


You could also add a printk in vlan_device_event() to check which  
event it is hanging on, and the netdevice that is passed in.


Ok, I'll try building a 2.6.21 kernel with lockdep and some debugging  
printk()s in the vlan_device_event() function and get back to you  
tomorrow.  Thanks for the quick response!


Since the vlan code holds RTNL at this point, then most other  
network tasks will eventually hang as well.


Well, it's less of an eventually and more of an almost  
immediately.  When that happens pretty close to everything more  
complicated than socket(), bind(), and connect() with straightforward  
UNIX or INET sockets tends to stick completely.


Thanks again!

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] [RFC] AF_RXRPC socket family implementation [try #2]

2007-03-17 Thread Kyle Moffett

On Mar 16, 2007, at 10:11:41, Alan Cox wrote:
I know what they are; and I don't think that what's available  
covers it.



and use a proper standard socket type.


Assuming that that list is exhaustive...


SOCK_RDM seems to match perfectly well here. The point isn't to  
enumerate everything in the universe the point is to find works  
like parallels good enough to avoid more special casing.


IMHO the problem with classifying RxRPC as a reliable datagram  
socket is that even an atomic unidirectional communication isn't a  
single datagram, it's at least 3; there is shared connection state  
and security context on both sides which pertains to a collection of  
independent and possibly simultaneous RxRPC calls.  From the digging  
around that I did in the kernel socket code a while ago I don't see a  
cleaner way of implementing it than a new SOCK_RXRPC.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] the scheduled removal of the frame diverter

2006-11-13 Thread Kyle Moffett

On Nov 13, 2006, at 16:04:25, Adrian Bunk wrote:

This patch contains the scheduled removal of the frame diverter.

[snip]

-config NET_DIVERT
-   bool Frame Diverter (EXPERIMENTAL)
-   depends on EXPERIMENTAL  BROKEN
-   ---help---
- The Frame Diverter allows you to divert packets from the
- network, that are not aimed at the interface receiving it (in
- promisc. mode). Typically, a Linux box setup as an Ethernet bridge
- with the Frames Diverter on, can do some *really* transparent www
- caching using a Squid proxy for example.


From my understanding of iptables/ebtables, identical functionality  
is already avaialble within that framework; and as such this patch is  
just removing broken experimental and redundant code.  The IPTables  
code also properly handles IPv6 and all the other old warts of the  
frame diverter as well.  I agree that this should go.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wireless: recap of current issues (configuration)

2006-01-17 Thread Kyle Moffett

On Jan 17, 2006, at 13:41, Stuffed Crust wrote:

On Mon, Jan 16, 2006 at 10:24:41PM +, Alan Cox wrote:
If I have told my equipment to obey UK law I expect it to do so.  
If I hop on the train to France and forget to revise my  
configuration I'd prefer it also believed the APs


It's not that you might forget to revise your configuration, but  
that the vast majority of users will not revise anything, and still  
expect things to just work.  Kind of like multi-band cell phones.


Alan's point is still very valid.  From a poweruser point of view, if  
I specifically tell my wireless client You must obey US laws, and  
then I wander over past a broken imported AP, I don't want my client  
to _expand_ its allowable range.  IMHO, userspace should be able to  
forcibly restrict wireless frequencies to a certain regdomain (or  
leave unrestricted and passive-scan-only), and specify how AP/ 
configured regdomains act.  Given the range of possibilities, I think  
that a userspace daemon monitoring events and dynamically configuring  
the useable frequencies would best.  That way the userspace daemon  
could be configured to ignore APs, union/intersect the APs with the  
configured regdomain, ignore the configured regdomain in the presence  
of APs, etc.


Cheers,
Kyle Moffett

--
I lost interest in blade servers when I found they didn't throw  
knives at people who weren't supposed to be in your machine room.

  -- Anthony de Boer


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Fine-grained memory priorities and PI

2005-12-15 Thread Kyle Moffett

On Dec 15, 2005, at 03:21, David S. Miller wrote:
Not when we run out, but rather when we reach some low water mark,  
the critical sockets would still use GFP_ATOMIC memory but only  
critical sockets would be allowed to do so.


But even this has faults, consider the IPSEC scenerio I mentioned,  
and this applies to any kind of encapsulation actually, even simple  
tunneling examples can be concocted which make the critical  
socket idea fail.


The knee jerk reaction is mark IPSEC's sockets critical, and mark  
the tunneling allocations critical, and... and...  well you have  
GFP_ATOMIC then my friend.


In short, these seperate page pool and critical socket ideas do  
not work and we need a different solution, I'm sorry folks spent so  
much time on them, but they are heavily flawed.


What we really need in the kernel is a more fine-grained memory  
priority system with PI, similar in concept to what's being done to  
the scheduler in some of the RT patchsets.  Currently we have a very  
black-and-white memory subsystem; when we go OOM, we just start  
killing processes until we are no longer OOM.  Perhaps we should have  
some way to pass memory allocation priorities throughout the kernel,  
including a this request has X priority, this request will help  
free up X pages of RAM, and drop while dirty under certain OOM to  
free X memory using this method.


The initial benefit would be that OOM handling would become more  
reliable and less of a special case.  When we start to run low on  
free pages, it might be OK to kill the [EMAIL PROTECTED] process long before  
we OOM if such action might prevent the OOM.  Likewise, you might be  
able to flag certain file pages as being less critical, such that  
the kernel can kill a process and drop its dirty pages for files in / 
tmp.  Or the kernel might do a variety of other things just by  
failing new allocations with low priority and forcing existing  
allocations with low priority to go away using preregistered handlers.


When processes request memory through any subsystem, their memory  
priority would be passed through the kernel layers to the allocator,  
along with any associated information about how to free the memory in  
a low-memory condition.  As a result, I could configure my database  
to have a much higher priority than [EMAIL PROTECTED] (or boinc or whatever),  
so that when the database server wants to fill memory with clean DB  
cache pages, the kernel will kill [EMAIL PROTECTED] for it's memory, even if  
we could just leave some DB cache pages unfaulted.


Questions? Comments? This is a terrible idea that should never have  
seen the light of day? Both constructive and destructive criticism  
welcomed! (Just please keep the language clean! :-D)


Cheers,
Kyle Moffett

--
Q: Why do programmers confuse Halloween and Christmas?
A: Because OCT 31 == DEC 25.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Fine-grained memory priorities and PI

2005-12-15 Thread Kyle Moffett

On Dec 15, 2005, at 07:45, Con Kolivas wrote:
I have some basic process-that-called the memory allocator link in  
the -ck tree already which alters how aggressively memory is  
reclaimed according to priority. It does not affect out of memory  
management but that could be added to said algorithm; however I  
don't see much point at the moment since oom is still an uncommon  
condition but regular memory allocation is routine.


My thought would be to generalize the two special cases of writeback  
of dirty pages or dropping of clean pages under memory pressure and  
OOM to be the same general case.  When you are trying to free up  
pages, it may be permissible to drop dirty mbox pages and kill the  
postfix process writing them in order to satisfy allocations for the  
mission-critical database server.  (Or maybe it's the other way  
around).  If a large chunk of the allocated pages have priorities and  
lossless/lossy free functions, then the kernel can be much more  
flexible and configurable about what to do when running low on RAM.


Cheers,
Kyle Moffett

--
I lost interest in blade servers when I found they didn't throw  
knives at people who weren't supposed to be in your machine room.

  -- Anthony de Boer


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html