date:20071010

Re: [PATCH] ipv4: kernel panic when only one unsecured port available

2007-10-10 Thread Anton Arapov

Hi,

"Denis V. Lunev" <[EMAIL PROTECTED]> writes:
> This code is broken from the very beginning.
>
> iris den # cat /proc/sys/net/ipv4/ip_local_port_range
> 32768   61000
> iris den # echo 32768 32 >/proc/sys/net/ipv4/ip_local_port_range
> iris den # cat /proc/sys/net/ipv4/ip_local_port_range
> 32768   32
> iris den # echo 32768 61000 >/proc/sys/net/ipv4/ip_local_port_range

  If you're talking about checks in sysctl, I believe it should be
another patch for sysctl only, and I'm going to push it via -mm tree.

  the devision by zero exists in inet_connection_socket.c, and must be
fixed for sure because the situation with the same min and max port
numbers in sysctl are possible and not prohibited.

Cheers!
-- 
Anton Arapov, <[EMAIL PROTECTED]>
GPG Key ID: 0x6FA8C812

pgpltcnDPkAOC.pgp
Description: PGP signature

Re: [RFC/PATCH 2/4] UDP memory usage accounting (take 4): accounting unit and variable

2007-10-10 Thread Satoshi OSHIMA

Hi Evgeniy,

Thank you for your comment.

> Hi.
> 
> On Sat, Oct 06, 2007 at 12:01:07AM +0900, Satoshi OSHIMA ([EMAIL PROTECTED]) 
> wrote:
>> --- 2.6.23-rc3-udp_limit.orig/net/ipv4/udp.c
>> +++ 2.6.23-rc3-udp_limit/net/ipv4/udp.c
>> @@ -113,6 +113,10 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_sta
>>  struct hlist_head udp_hash[UDP_HTABLE_SIZE];
>>  DEFINE_RWLOCK(udp_hash_lock);
>>  
>> +atomic_t udp_memory_allocated;
>> +
>> +EXPORT_SYMBOL(udp_memory_allocated);
>> +
> 
> Why do you export this variable?
> It is not accessed from modules in your patchset.

Good point! I'll fix it.

Satoshi Oshima
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ipv4: kernel panic when only one unsecured port available

2007-10-10 Thread Anton Arapov

"Denis V. Lunev" <[EMAIL PROTECTED]> writes:
> Anton Arapov wrote: 
>> "Denis V. Lunev" <[EMAIL PROTECTED]> writes:
>>> This code is broken from the very beginning.
>>>
>>> iris den # cat /proc/sys/net/ipv4/ip_local_port_range
>>> 32768   61000
>>> iris den # echo 32768 32 >/proc/sys/net/ipv4/ip_local_port_range
>>> iris den # cat /proc/sys/net/ipv4/ip_local_port_range
>>> 32768   32
>>> iris den # echo 32768 61000 >/proc/sys/net/ipv4/ip_local_port_range
>> 
>>   If you're talking about checks in sysctl, I believe it should be
>> another patch for sysctl only, and I'm going to push it via -mm tree.
>> 
>>   the devision by zero exists in inet_connection_socket.c, and must be
>> fixed for sure because the situation with the same min and max port
>> numbers in sysctl are possible and not prohibited.
>> 
>> Cheers!
>
> your patch change nothing :( unfortunately. If I set '32768 32767' it
> will oops again.

  Patch prevents the system crash. System traps on division by zero.

  Your case(MAX
Kernel Development, Red Hat
GPG Key ID: 0x6FA8C812


pgpMdgddHvlK9.pgp
Description: PGP signature

Re: [PATCH] ipv4: kernel panic when only one unsecured port available

2007-10-10 Thread Denis V. Lunev

Anton Arapov wrote:
> "Denis V. Lunev" <[EMAIL PROTECTED]> writes:
>> Anton Arapov wrote: 
>>> "Denis V. Lunev" <[EMAIL PROTECTED]> writes:
 This code is broken from the very beginning.

 iris den # cat /proc/sys/net/ipv4/ip_local_port_range
 32768   61000
 iris den # echo 32768 32 >/proc/sys/net/ipv4/ip_local_port_range
 iris den # cat /proc/sys/net/ipv4/ip_local_port_range
 32768   32
 iris den # echo 32768 61000 >/proc/sys/net/ipv4/ip_local_port_range
>>>   If you're talking about checks in sysctl, I believe it should be
>>> another patch for sysctl only, and I'm going to push it via -mm tree.
>>>
>>>   the devision by zero exists in inet_connection_socket.c, and must be
>>> fixed for sure because the situation with the same min and max port
>>> numbers in sysctl are possible and not prohibited.
>>>
>>> Cheers!
>> your patch change nothing :( unfortunately. If I set '32768 32767' it
>> will oops again.
> 
>   Patch prevents the system crash. System traps on division by zero.
> 
>   Your case(MAX that I have to join patch for sysctl.c to this one? It's bad idea.
> 

both versions of settings, your ones and my ones are _useless_ in real
life. So, we do some sanity fixes. Am I right? If so, we must prevent
all versions of OOPS (aka division by zero here).

I'll send my vision in a moment...

Regards,
Den
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ipv4: kernel panic when only one unsecured port available

2007-10-10 Thread David Miller

From: "Denis V. Lunev" <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 12:38:37 +0400

> both versions of settings, your ones and my ones are _useless_ in real
> life. So, we do some sanity fixes. Am I right? If so, we must prevent
> all versions of OOPS (aka division by zero here).
> 
> I'll send my vision in a moment...

I agree with Denis that we should plug all of the holes when fixing
this.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ipv4: kernel panic when only one unsecured port available

2007-10-10 Thread Denis V. Lunev

Anton Arapov wrote:
> Hi,
> 
> "Denis V. Lunev" <[EMAIL PROTECTED]> writes:
>> This code is broken from the very beginning.
>>
>> iris den # cat /proc/sys/net/ipv4/ip_local_port_range
>> 32768   61000
>> iris den # echo 32768 32 >/proc/sys/net/ipv4/ip_local_port_range
>> iris den # cat /proc/sys/net/ipv4/ip_local_port_range
>> 32768   32
>> iris den # echo 32768 61000 >/proc/sys/net/ipv4/ip_local_port_range
> 
>   If you're talking about checks in sysctl, I believe it should be
> another patch for sysctl only, and I'm going to push it via -mm tree.
> 
>   the devision by zero exists in inet_connection_socket.c, and must be
> fixed for sure because the situation with the same min and max port
> numbers in sysctl are possible and not prohibited.
> 
> Cheers!

your patch change nothing :( unfortunately. If I set '32768 32767' it
will oops again.

Regards,
Den
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread Denis V. Lunev

This patch fixed a possible division-by-zero in inet_csk_get_port
treating situation low > high as if low == high.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
CC: Antov Arapov  <[EMAIL PROTECTED]>

--- ./net/ipv4/inet_connection_sock.c.getport   2007-10-09 15:16:02.0 
+0400
+++ ./net/ipv4/inet_connection_sock.c   2007-10-10 12:44:04.0 +0400
@@ -80,7 +80,14 @@ int inet_csk_get_port(struct inet_hashin
int low = sysctl_local_port_range[0];
int high = sysctl_local_port_range[1];
int remaining = (high - low) + 1;
-   int rover = net_random() % (high - low) + low;
+   int rover;
+
+   /* Treat low > high as high == low */
+   if (remaining <= 1) {
+   remaining = 1;
+   rover = low;
+   } else
+   rover = net_random() % (high - low) + low;
 
do {
head = &hashinfo->bhash[inet_bhashfn(rover, 
hashinfo->bhash_size)];
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-10 Thread Pavel Emelyanov

Eric W. Biederman wrote:
> Pavel Emelyanov <[EMAIL PROTECTED]> writes:
> 
>> Currently indexes for netdevices come sequentially one by
>> one, and the same stays true even for devices that are 
>> created for namespaces.
>>
>> Side effects of this are:
>>  * lo device has not 1 index in a namespace. This may break
>>some userspace that relies on it (and AFAIR something
>>really broke in OpenVZ VEs without this);
> 
> As it happens lo hasn't been registered first for some time
> so it hasn't had ifindex of 1 in the normal kernel.
> 
>>  * after some time namespaces will have devices with indexes
>>like 100 os similar. This might be confusing for a
>>human (tools will not mind).
> 
> Only if we wind up creating that many devices.

Nope. Create and destroy new net ns for 1 times and you'll get it.

>> So move the (currently "global" and static) ifindex variable
>> on the struct net, making the indexes allocation look more
>> like on a standalone machine.
>>
>> Moreover - when we have indexes intersect between namespaces,
>> we may catch more BUGs in the future related to "wrong device 
>> was found for a given index".
> 
> Not yet.
> 
> I know there are several data structures internal to the kernel that
> are indexed by ifindex, and not struct net_device *.  There is the
> iflink field in struct net_device.  We need a way to refer to network
> devices in other namespaces in rtnetlink in an unambiguous way.   I
> don't see any real problems with a global ifindex assignment until
> we start migrating applications.
> 
> So please hold off on this until the kernel has been audited and
> we have removed all of the uses of ifindex that assume ifindex is
> global, that we can find.

Ok.

> Right now a namespace local ifindex seems to be just asking for
> trouble.

You said the same about caching the global pid on the task_struct,
but looks like you were wrong ;) Just kidding.

> Eric
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PROBLEM: skb_clone SMP race?

2007-10-10 Thread Santiago Font Arquer

Hello,
   I'm studying the implementation of sk_buff and I think there's a
possible race condition in skb_clone (2.6.22.9)
   The code is:


struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
{
 struct sk_buff *n;

 n = skb + 1;
 if (skb->fclone == SKB_FCLONE_ORIG &&
n->fclone == SKB_FCLONE_UNAVAILABLE) {
 atomic_t *fclone_ref = (atomic_t *) (n + 1);
 n->fclone = SKB_FCLONE_CLONE;
 atomic_inc(fclone_ref);
 } else {
 n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 if (!n)
  return NULL;
 n->fclone = SKB_FCLONE_UNAVAILABLE;
 }

   If an skb with fast clone available (first "if" true) has
references in different CPUs (skb->users>1) (I do not find explicit
checks for this to be impossible), if skb_clone is called
simultaneously over that skb, both callers can get the same clone (the
"fast" clone) and different problems follow: wrong "clone_skb->users"
(1 as expected by the caller, but it should be, to be true, 2),
fclone_ref set to 3 involving further problems, ...

  IMO, the same problem arises although the calls to skb_clone are
not simultaneous: there isn´t a memory barrier after the change of
"n->fclone" to guarantee the visibility of that change to other CPUs
(but that barrier will not solve anything; I mentioned this only to
reflect another reason I see for the race to happen).

  Is that correct? Thank you in advance.


   Santiago Font Arquer
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread Anton Arapov


Ok, I've got it, so we have to do the same with the following:
quote from inet_hashtables.c and inet6_hashtables.c. I'll prepare the
patch.

And just a curious, does the /* Treat low > high as high == low */
idea will keep after the sysctl will be patched?

int inet_hash_connect(struct inet_timewait_death_row *death_row,
  struct sock *sk)
{
struct inet_hashinfo *hinfo = death_row->hashinfo;
const unsigned short snum = inet_sk(sk)->num;
struct inet_bind_hashbucket *head;
struct inet_bind_bucket *tb;
int ret;

if (!snum) {
int low = sysctl_local_port_range[0];
int high = sysctl_local_port_range[1];
>int range = high - low;
int i;
int port;
static u32 hint;
u32 offset = hint + inet_sk_port_offset(sk);
struct hlist_node *node;
struct inet_timewait_sock *tw = NULL;

local_bh_disable();
for (i = 1; i <= range; i++) {
>port = low + (i + offset) % range;
 

"Denis V. Lunev" <[EMAIL PROTECTED]> writes:
> This patch fixed a possible division-by-zero in inet_csk_get_port
> treating situation low > high as if low == high.
>
> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
> CC: Antov Arapov  <[EMAIL PROTECTED]>
>
> --- ./net/ipv4/inet_connection_sock.c.getport 2007-10-09 15:16:02.0 
> +0400
> +++ ./net/ipv4/inet_connection_sock.c 2007-10-10 12:44:04.0 +0400
> @@ -80,7 +80,14 @@ int inet_csk_get_port(struct inet_hashin
>   int low = sysctl_local_port_range[0];
>   int high = sysctl_local_port_range[1];
>   int remaining = (high - low) + 1;
> - int rover = net_random() % (high - low) + low;
> + int rover;
> +
> + /* Treat low > high as high == low */
> + if (remaining <= 1) {
> + remaining = 1;
> + rover = low;
> + } else
> + rover = net_random() % (high - low) + low;
>  
>   do {
>   head = &hashinfo->bhash[inet_bhashfn(rover, 
> hashinfo->bhash_size)];

-- 
Anton Arapov, <[EMAIL PROTECTED]>
Kernel Development, Red Hat
GPG Key ID: 0x6FA8C812


pgp4C9bqqJZFq.pgp
Description: PGP signature

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen

> A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you

With TSO really? 

> increase the size much more performance starts to go down due to L2
> cache thrashing.

Another possibility would be to consider using cache avoidance
instructions while updating the TX ring (e.g. write combining 
on x86) 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller

From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 11:16:44 +0200

> > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
> 
> With TSO really? 

Yes.

> > increase the size much more performance starts to go down due to L2
> > cache thrashing.
> 
> Another possibility would be to consider using cache avoidance
> instructions while updating the TX ring (e.g. write combining 
> on x86) 

The chip I was working with at the time (UltraSPARC-IIi) compressed
all the linear stores into 64-byte full cacheline transactions via
the store buffer.

It's true that it would allocate in the L2 cache on a miss, which
is different from your suggestion.

In fact, such a thing might not pan out well, because most of the time
you write a single descriptor or two, and that isn't a full cacheline,
which means a read/modify/write is the only coherent way to make such
a write to RAM.

Sure you could batch, but I'd rather give the chip work to do unless
I unequivocably knew I'd have enough pending to fill a cacheline's
worth of descriptors.  And since you suggest we shouldn't queue in
software... :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6][NET-2.6.24] Introduce the seq_open_private()

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 19:52:58 +0400

> This function allocates the zeroed chunk of memory and
> call seq_open(). The __seq_open_private() helper returns
> the allocated memory to make it possible for the caller
> to initialize it.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

Applied, nice cleanup Pavel.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6][NET-2.6.24] Make core networking code use seq_open_private

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 19:55:28 +0400

> This concerns the ipv4 and ipv6 code mostly, but also the netlink
> and unix sockets.
> 
> The netlink code is an example of how to use the __seq_open_private()
> call - it saves the net namespace on this private.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6][NET-2.6.24] Make netfilter code use the seq_open_private

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 19:57:29 +0400

> Just switch to the consolidated calls.
> 
> ipt_recent() has to initialize the private, so use
> the __seq_open_private() helper.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> Cc: Patrick McHardy <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6][NET-2.6.24] Make decnet code use the seq_open_private()

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 19:59:38 +0400

> Just switch to the consolidated code.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> Cc: Patrick Caulfield <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6][NET-2.6.24] Make the IRDA use the seq_open_private()

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 20:01:32 +0400

> Just switch to the consolidated code
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> Cc: Samuel Ortiz <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread David Miller

From: Anton Arapov <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 11:00:17 +0200

> Ok, I've got it, so we have to do the same with the following:
> quote from inet_hashtables.c and inet6_hashtables.c. I'll prepare the
> patch.
> 
> And just a curious, does the /* Treat low > high as high == low */
> idea will keep after the sysctl will be patched?

I'm beginning to think that we should do the sysctl validation
in this patch too, instead of duplicating this grotty check
in all of these port selection functions.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases

2007-10-10 Thread David Miller

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Tue,  9 Oct 2007 15:20:02 +0300

> This addition of lost_retrans_low to tcp_sock might be
> unnecessary, it's not clear how often lost_retrans worker is
> executed when there wasn't work to do.
> 
> Cc: TAKANO Ryousei <[EMAIL PROTECTED]>
> Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>

I wanted to apply this, but it doesn't go cleanly on top of
net-2.6.24, can you respin this patch for me?

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases

2007-10-10 Thread David Miller

From: David Miller <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 02:44:03 -0700 (PDT)

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Tue,  9 Oct 2007 15:20:02 +0300
> 
> > This addition of lost_retrans_low to tcp_sock might be
> > unnecessary, it's not clear how often lost_retrans worker is
> > executed when there wasn't work to do.
> > 
> > Cc: TAKANO Ryousei <[EMAIL PROTECTED]>
> > Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
> 
> I wanted to apply this, but it doesn't go cleanly on top of
> net-2.6.24, can you respin this patch for me?

Nevermind, I mis-interpreted the ordering of the 3 patches,
sorry...
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [TCP]: Separate lost_retrans loop into own function

2007-10-10 Thread David Miller

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Tue,  9 Oct 2007 15:20:00 +0300

> Follows own function for each task principle, this is really
> somewhat separate task being done in sacktag. Also reduces
> indentation.
> 
> In addition, added ack_seq local var to break some long
> lines & fixed coding style things.
> 
> Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>

Applied, thanks Ilpo!
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][NET-2.6.24] Remove double dev->flags checking when calling dev_close()

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 14:50:54 +0400

> The unregister_netdevice() and dev_change_net_namespace() 
> both check for dev->flags to be IFF_UP before calling the 
> dev_close(), but the dev_close() checks for IFF_UP itself, 
> so remove those unneeded checks.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

Applied, thanks Pavel.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness

2007-10-10 Thread David Miller

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Tue, 9 Oct 2007 16:03:29 +0300 (EEST)

> On Tue, 9 Oct 2007, Ilpo Järvinen wrote:
> 
> > Lost_retrans handling of sacktag was found to be flawed, two
> > problems that were found have an intertwined solution. Fastpath
> > problem has existed since hints got added and the other problem
> > has probably been there even longer than that. ...This change
> > may add non-trivial processing cost.
> > 
> > Initial sketch, only compile tested. This will become more and
> > more useful, when sacktag starts to process less and less skbs,
> > which hopefully happens quite soon... :-) Sadly enough it will
> > probably then be consuming part of the benefits we're able to
> > achieve by less skb walking...
> > 
> > First one is trivial, so Dave might want to apply it already.
> 
> Hmm, forgot to add -n to git-format-patch. Since it's currently
> RFC, I won't bother to resubmit with numbers unless somebody
> really wants that. Here's the correct ordering, if it's not
> obvious from the patches alone:

I'm going to leave the 2nd patches and 3rd patches alone for
now so they can cook a little bit longer.

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/8][BNX2X] resubmit as attachments: add bnx2x to Kconfig and Makefile

2007-10-10 Thread David Miller

From: "Eliezer Tamir" <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 18:20:01 +0200

> Almost all of the zero-filled tables will be removed.
> The rest of the registers do need to be initialized.
> 
> I agree that the number of registers that needs to be initialized is
> huge, but that is caused by the way the hardware was designed.
> 
> The values for the initialization come from several sources:
> Some are derived from HW code (the XML files used to derive the verilog
> code),
> Others (along with much of the machine generated .h files) are generated
> at microcode build time, adding a microcode routine will cause the init
> values to change, using a new variable can cause an .h file to change.
> In the last group which is very small, are registers that are controlled
> by the driver.
> 
> The values in this file really are machine generated, they really are
> not meant to be modified directly by editing the file.
> 
> The registers that are under the driver's control are in the main .c
> and .h files.
...
> The idle check code is not a manufacturing test, it is meant to help
> debug the driver and microcode.
> If the driver sends an invalid command to one of the CPUs which then
> chokes on it, this will tell you which one of them died and the general
> whereabouts of the problem. (ingress CPU X is stuck because output queue
> Y is full)
 ...
> ( Michael has showed me the trick of how to post with evolution, so I
> hope that the mangled patch problem is behind us and I think that I can
> now post everything without a problem, Hallelujah!)

Thanks for the explanations, I look forward to your next
submission.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6][NET-2.6.24] Make the sunrpc use the seq_open_private()

2007-10-10 Thread David Miller

From: Pavel Emelyanov <[EMAIL PROTECTED]>
Date: Tue, 09 Oct 2007 20:04:23 +0400

> Just switch to the consolidated code.
> 
> Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
> Cc: Neil Brown <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPV6] Defer IPv6 device initialization until a valid qdisc is specified

2007-10-10 Thread David Miller

From: Mitsuru Chinen <[EMAIL PROTECTED]>
Date: Tue, 9 Oct 2007 16:21:58 +0900

> To judge the timing for DAD, netif_carrier_ok() is used. However,
> there is a possibility that dev->qdisc stays noop_qdisc even if
> netif_carrier_ok() returns true. In that case, DAD NS is not sent out.
> We need to defer the IPv6 device initialization until a valid qdisc
> is specified.
> 
> Signed-off-by: Mitsuru Chinen <[EMAIL PROTECTED]>
> Signed-off-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]>

Thanks for submitting the fix.

Although Herbert is right that this does not fix the problem
universally, it does make things better, so I will apply this
patch.

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread David Miller

From: Anton Arapov <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 11:56:23 +0200

>   Yep, that's exactly I'm talking about. I'm sure that 
>   [...] % (high - low) [...] erroneous from the begining, because
> in such places we want to have 1 in denominator, for the cases when we
> have only one port. Because 34000 34000 in sysctl's
> ip_local_port_range means 1(one) port, not 0(zero).
> 
>   So it seems to me that we have to fix mentioned denominators in
> kernel/net to have 1, that will be correct logically. And do the
> MAX   From this point of view, it's best idea to have two patches: one for
> the kernel/net denominators and another one for the sysctl.c's
> function dointvec_minmax(). Because they can live independently. And
> the patch for the kernel/net will do the work at least because we
> prevent kernel trap at all.
>   
>   Dave, am I right?

Sure, two patches is fine.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] natsemi: Use round_jiffies() for slow timers

2007-10-10 Thread Mark Brown

Unless we have failed to fill the RX ring the timer used by the natsemi
driver is not particularly urgent and can use round_jiffies() to allow
grouping with other timers.

Signed-off-by: Mark Brown <[EMAIL PROTECTED]>
---
Rediffed against current netdev-2.6.git#upstream

 drivers/net/natsemi.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/natsemi.c b/drivers/net/natsemi.c
index 527f9dc..b881786 100644
--- a/drivers/net/natsemi.c
+++ b/drivers/net/natsemi.c
@@ -1576,7 +1576,7 @@ static int netdev_open(struct net_device *dev)
 
/* Set the timer to check for link beat. */
init_timer(&np->timer);
-   np->timer.expires = jiffies + NATSEMI_TIMER_FREQ;
+   np->timer.expires = round_jiffies(jiffies + NATSEMI_TIMER_FREQ);
np->timer.data = (unsigned long)dev;
np->timer.function = &netdev_timer; /* timer handler */
add_timer(&np->timer);
@@ -1856,7 +1856,11 @@ static void netdev_timer(unsigned long data)
next_tick = 1;
}
}
-   mod_timer(&np->timer, jiffies + next_tick);
+
+   if (next_tick > 1)
+   mod_timer(&np->timer, round_jiffies(jiffies + next_tick));
+   else
+   mod_timer(&np->timer, jiffies + next_tick);
 }
 
 static void dump_ring(struct net_device *dev)
@@ -3331,7 +3335,7 @@ static int natsemi_resume (struct pci_dev *pdev)
spin_unlock_irq(&np->lock);
enable_irq(dev->irq);
 
-   mod_timer(&np->timer, jiffies + 1*HZ);
+   mod_timer(&np->timer, round_jiffies(jiffies + 1*HZ));
}
netif_device_attach(dev);
 out:
-- 
1.5.3.4

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-2.6.24 0/5]: TCP sacktag cache usage recoded

2007-10-10 Thread David Miller

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Mon, 24 Sep 2007 13:28:42 +0300

> After couple of wrong-wayed before/after()s and one infinite
> loopy version, here's the current trial version of a sacktag  
> cache usage recode
> 
> Two first patches come from tcp-2.6 (rebased and rotated).
> This series apply cleanly only on top of the other three patch
> series I posted earlier today. The last debug patch provides
> some statistics for those interested enough.
> 
> Dave, please DO NOT apply! ...Some thoughts could be nice
> though :-).

Ilpo, I have not forgotten about this patch set.

It is something I plan to look over after the madness of merging
net-2.6.24 to Linus is complete.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Herbert Xu

On Wed, Oct 10, 2007 at 11:16:44AM +0200, Andi Kleen wrote:
> > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
> 
> With TSO really? 

Hardware queues are generally per-page rather than per-skb so
it'd fill up quicker than a software queue even with TSO.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread Anton Arapov

David Miller <[EMAIL PROTECTED]> writes:
>> Ok, I've got it, so we have to do the same with the following:
>> quote from inet_hashtables.c and inet6_hashtables.c. I'll prepare the
>> patch.
>> 
>> And just a curious, does the /* Treat low > high as high == low */
>> idea will keep after the sysctl will be patched?
>
> I'm beginning to think that we should do the sysctl validation
> in this patch too, instead of duplicating this grotty check
> in all of these port selection functions.
  Yep, that's exactly I'm talking about. I'm sure that 
  [...] % (high - low) [...] erroneous from the begining, because
in such places we want to have 1 in denominator, for the cases when we
have only one port. Because 34000 34000 in sysctl's
ip_local_port_range means 1(one) port, not 0(zero).

  So it seems to me that we have to fix mentioned denominators in
kernel/net to have 1, that will be correct logically. And do the
MAX
Kernel Development, Red Hat
GPG Key ID: 0x6FA8C812


pgp0HegHwE2UE.pgp
Description: PGP signature

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen

On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
> The chip I was working with at the time (UltraSPARC-IIi) compressed
> all the linear stores into 64-byte full cacheline transactions via
> the store buffer.

That's a pretty old CPU. Conclusions on more modern ones might be different.

> In fact, such a thing might not pan out well, because most of the time
> you write a single descriptor or two, and that isn't a full cacheline,
> which means a read/modify/write is the only coherent way to make such
> a write to RAM.

x86 WC does R-M-W and is coherent of course. The main difference is 
just that the result is not cached.  When the hardware accesses the cache line
then the cache should be also invalidated.

> Sure you could batch, but I'd rather give the chip work to do unless
> I unequivocably knew I'd have enough pending to fill a cacheline's
> worth of descriptors.  And since you suggest we shouldn't queue in
> software... :-)

Hmm, it probably would need to be coupled with batched submission if 
multiple packets are available you're right. Probably not worth doing explicit
queueing though.

I suppose it would be an interesting experiment at least.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller

From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 12:23:31 +0200

> On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
> > The chip I was working with at the time (UltraSPARC-IIi) compressed
> > all the linear stores into 64-byte full cacheline transactions via
> > the store buffer.
> 
> That's a pretty old CPU. Conclusions on more modern ones might be different.

Cache matters, just scale the numbers.

> I suppose it would be an interesting experiment at least.

Absolutely.

I've always gotten very poor results when increasing the TX queue a
lot, for example with NIU the point of diminishing returns seems to
be in the range of 256-512 TX descriptor entries and this was with
1.6Ghz cpus.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-2.6.24 0/5]: TCP sacktag cache usage recoded

2007-10-10 Thread Ilpo Järvinen

On Wed, 10 Oct 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Mon, 24 Sep 2007 13:28:42 +0300
> 
> > After couple of wrong-wayed before/after()s and one infinite
> > loopy version, here's the current trial version of a sacktag  
> > cache usage recode
> > 
> > Two first patches come from tcp-2.6 (rebased and rotated).
> > This series apply cleanly only on top of the other three patch
> > series I posted earlier today. The last debug patch provides
> > some statistics for those interested enough.
> > 
> > Dave, please DO NOT apply! ...Some thoughts could be nice
> > though :-).
> 
> Ilpo, I have not forgotten about this patch set.
> 
> It is something I plan to look over after the madness of merging
> net-2.6.24 to Linus is complete.

Thanks, there's probably going to be some trouble though, I'd bet it 
doesn't anymore apply cleanly to net-2.6.23 HEAD because of something else 
that got applied (don't remember exactly but I guess that highest_sack 
reno fix did that).

I try to get them resent soon but currently my thoughts are in solving 
DSACK ignored bug (and doing the associated cleanups) which again will 
cause those code move conflicts to reoccur. Therefore I'd love to postpone 
the rebase a bit... Hmm, SACK code is under such flux currently that I'll 
have to deal conflicts almost daily due to overlapping ideas...

-- 
 i.

Re: [RFC PATCH net-2.6.24 0/5]: TCP sacktag cache usage recoded

2007-10-10 Thread David Miller

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 13:26:05 +0300 (EEST)

> Hmm, SACK code is under such flux currently that I'll 
> have to deal conflicts almost daily due to overlapping ideas...

Welcome to my world, just scale it to 800 patches and entire
networking tree :-

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Patrick McHardy

Denys wrote:
>here is try to switch clocksource to acpi_pm
> 
> 
> Time: acpi_pm clocksource has been installed.
> Clockevents: could not switch to one-shot mode: lapic is not functional.
> Could not switch to high resolution mode on CPU 0

What does /proc/net/psched contain?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[re] Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Denys

> What does /proc/net/psched contain?
visp-1 ~ # cat /proc/net/psched
03e8 0400 000f4240 3b9aca00


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [re] Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Patrick McHardy

Denys wrote:
>>What does /proc/net/psched contain?
> 
> visp-1 ~ # cat /proc/net/psched
> 03e8 0400 000f4240 3b9aca00

OK, hrtimers are disabled on your system, but we still announce
the usec clock resolution to userspace, which is used by HTB to
calculate the burst rate. But actually that can't be the reason
since that has already been the case in 2.6.22. Please post a diff
of the bootlog from 2.6.22 and 2.6.23.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread Anton Arapov

David Miller <[EMAIL PROTECTED]> writes:
> From: Anton Arapov <[EMAIL PROTECTED]>
> Date: Wed, 10 Oct 2007 11:56:23 +0200
>
>>   Yep, that's exactly I'm talking about. I'm sure that 
>>   [...] % (high - low) [...] erroneous from the begining, because
>> in such places we want to have 1 in denominator, for the cases when we
>> have only one port. Because 34000 34000 in sysctl's
>> ip_local_port_range means 1(one) port, not 0(zero).
>> 
>>   So it seems to me that we have to fix mentioned denominators in
>> kernel/net to have 1, that will be correct logically. And do the
>> MAX>   From this point of view, it's best idea to have two patches: one for
>> the kernel/net denominators and another one for the sysctl.c's
>> function dointvec_minmax(). Because they can live independently. And
>> the patch for the kernel/net will do the work at least because we
>> prevent kernel trap at all.
>>   
>>   Dave, am I right?
>
> Sure, two patches is fine.

  I have been mistaken. We can't modify sysctl code itself to do the
checks like (MAX_VAL < MIN_VAL), we have generic functions, and if we
want implement something like this we have to implement absolutely new
functionality, it's insane to do it. :)
  It seems to me, all we can is to make this check in code where the
MAX_VAL
Kernel Development, Red Hat
GPG Key ID: 0x6FA8C812


pgpKdvARE900u.pgp
Description: PGP signature

Re: [Devel] [PATCH 1/5] net: Modify all rtnetlink methods to only work in the initial namespace

2007-10-10 Thread Denis V. Lunev

Eric W. Biederman wrote:
> Before I can enable rtnetlink to work in all network namespaces
> I need to be certain that something won't break.  So this
> patch deliberately disables all of the rtnletlink methods in everything
> except the initial network namespace.  After the methods have been
> audited this extra check can be disabled.
>
[...]
>  static int br_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
>  {
> + struct net *net = skb->sk->sk_net;
>   struct net_device *dev;
>   int idx;
>  

I've read some code today greping 'init_net.loopback_dev' and found
interesting non-trivial for me issue.

Network namespace is extracted from the packet in two different ways in
TCP. This is a socket for outgoing path and a device for incoming.
Though, there are some places called uniformly both from incoming and
outgoing path.

Typical example is netfilters. They are called uniformly all around the
code. The prototype is the following:

static unsigned int reject6_target(struct sk_buff **pskb,
   const struct net_device *in,
   const struct net_device *out,
   unsigned int hooknum,
   const struct xt_target *target,
   const void *targinfo);

So, we are bound to the following options:
- perform additional non-uniform hacks around to place 'struct net' into
  other and other structures like xt_target
- add 7th parameter here and over
- introduce an skb_net field in the 'struct sk_buff' making all code
  uniform, at least when we have an skb

I think that this is not the last place with such a parameter list and
we should make a decision at this point when the code in not mainline yet.

As far as I understand, netfilters are not touched by the Eric and we
can face some non-trivial problems there.

So, if my point about uniformity is valid, this patchset looks wrong and
should be re-worked :(

Regards,
Den
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Denys

Hi 2 all again

I have shaper, running very simple HTB tree (about 10 classes).

Traffic is coming via eth0, and going over eth0.1000, shaper installed on 
eth0.1000 (802.1Q vlan), total rate is 85Mbit/s.

On kernel 2.6.22 with bnx2 eth0 everything was working fine.

I have another server with similar task, with e1000, and was shaping rate 
lower than expected on 2.6.22. On full load it was 65-66Mbit/s instead 88Mbit/
s. So i postpone troubleshooting, and moved it as backup server.

When i upgrade server with bnx2 to 2.6.23 i got the same behaviour on bnx2 
too.

Even with extended burst/cburst it is around 82.5Mbit/s instead 85Mbit/s. 
(calculation with 5 second delay)
82319/82354 KBit/S (52684221/52706721) (0/0)
82375/82368 KBit/S (52720558/52716058) (52684221/52706721)
82500/82500 KBit/S (52800425/52800425) (105404779/105422779)
82406/82406 KBit/S (52740119/52740119) (158205204/158223204)
82615/82631 KBit/S (52873964/52884464) (210945323/210963323)
82459/82464 KBit/S (52774379/52777379) (263819287/263847787)


Here is tc -s -d class show dev eth0.2022 when i run without custom burst/
cburst (after 60 seconds)

class htb 1:100 root rate 85000Kbit ceil 85000Kbit burst 1583b/8 mpu 0b 
overhead 0b cburst 1583b/8 mpu 0b overhead 0b level 7
 Sent 511732440 bytes 348298 pkt (dropped 0, overlimits 0 requeues 0)
 rate 67637Kbit 5774pps backlog 0b 0p requeues 0
 lended: 4029 borrowed: 0 giants: 0
 tokens: -764 ctokens: -764

class htb 1:200 parent 1:100 leaf 200: prio 0 quantum 2000 rate 5000Kbit ceil 
85000Kbit burst 1600b/8 mpu 0b overhead 0b cburst 1583b/8 mpu 0b overhead 0b 
level 0
 Sent 1785787 bytes 8180 pkt (dropped 0, overlimits 0 requeues 0)
 rate 312608bit 162pps backlog 0b 0p requeues 0
 lended: 8090 borrowed: 90 giants: 0
 tokens: 2426 ctokens: 143

class htb 1:923 parent 1:920 leaf 923: prio 0 quantum 2000 rate 5000Kbit ceil 
85000Kbit burst 1600b/8 mpu 0b overhead 0b cburst 1583b/8 mpu 0b overhead 0b 
level 0
 Sent 16491797 bytes 10896 pkt (dropped 0, overlimits 0 requeues 0)
 rate 2171Kbit 179pps backlog 0b 0p requeues 0
 lended: 10673 borrowed: 223 giants: 0
 tokens: 138 ctokens: 8

class htb 1:910 parent 1:900 leaf 910: prio 0 quantum 2000 rate 5Kbit 
ceil 85000Kbit burst 1600b/8 mpu 0b overhead 0b cburst 1583b/8 mpu 0b 
overhead 0b level 0
 Sent 276577122 bytes 185876 pkt (dropped 7427, overlimits 0 requeues 0)
 rate 36497Kbit 3065pps backlog 0b 444p requeues 0
 lended: 182521 borrowed: 2911 giants: 0
 tokens: -455 ctokens: -117

class htb 1:922 parent 1:920 leaf 922: prio 5 quantum 2000 rate 5000Kbit ceil 
85000Kbit burst 1600b/8 mpu 0b overhead 0b cburst 1583b/8 mpu 0b overhead 0b 
level 0
 Sent 56949991 bytes 37706 pkt (dropped 34663, overlimits 0 requeues 0)
 rate 7298Kbit 604pps backlog 0b 823p requeues 0
 lended: 24844 borrowed: 12039 giants: 0
 tokens: -4641 ctokens: -110

class htb 1:900 parent 1:100 rate 8Kbit ceil 85000Kbit burst 1590b/8 mpu 
0b overhead 0b cburst 1583b/8 mpu 0b overhead 0b level 6
 Sent 509946653 bytes 340118 pkt (dropped 0, overlimits 0 requeues 0)
 rate 67324Kbit 5612pps backlog 0b 0p requeues 0
 lended: 10152 borrowed: 3939 giants: 0
 tokens: -850 ctokens: -764

class htb 1:921 parent 1:920 leaf 921: prio 5 quantum 2000 rate 2Kbit 
ceil 85000Kbit burst 1600b/8 mpu 0b overhead 0b cburst 1583b/8 mpu 0b 
overhead 0b level 0
 Sent 162349726 bytes 107246 pkt (dropped 51305, overlimits 0 requeues 0)
 rate 21252Kbit 1755pps backlog 0b 339p requeues 0
 lended: 94785 borrowed: 12122 giants: 0
 tokens: -1113 ctokens: -239

class htb 1:920 parent 1:900 rate 3Kbit ceil 85000Kbit burst 1593b/8 mpu 
0b overhead 0b cburst 1583b/8 mpu 0b overhead 0b level 5
 Sent 234033699 bytes 154686 pkt (dropped 0, overlimits 0 requeues 0)
 rate 30719Kbit 2538pps backlog 0b 0p requeues 0
 lended: 13204 borrowed: 11180 giants: 0
 tokens: -1780 ctokens: -505



Here is tc -s -d class show dev eth0.2022 when i run WITH custom burst/cburst 
(after 60 seconds)

class htb 1:100 root rate 85000Kbit ceil 85000Kbit burst 16Kb/8 mpu 0b 
overhead 0b cburst 8Kb/8 mpu 0b overhead 0b level 7
 Sent 637707898 bytes 430930 pkt (dropped 0, overlimits 0 requeues 0)
 rate 83982Kbit 7082pps backlog 0b 0p requeues 0
 lended: 22809 borrowed: 0 giants: 0
 tokens: -2453 ctokens: -3206

class htb 1:200 parent 1:100 leaf 200: prio 0 quantum 2000 rate 5000Kbit ceil 
85000Kbit burst 16Kb/8 mpu 0b overhead 0b cburst 8Kb/8 mpu 0b overhead 0b 
level 0
 Sent 1266780 bytes 7041 pkt (dropped 0, overlimits 0 requeues 0)
 rate 128920bit 102pps backlog 0b 0p requeues 0
 lended: 6983 borrowed: 58 giants: 0
 tokens: 23250 ctokens: 615

class htb 1:923 parent 1:920 leaf 923: prio 0 quantum 2000 rate 5000Kbit ceil 
85000Kbit burst 16Kb/8 mpu 0b overhead 0b cburst 8Kb/8 mpu 0b overhead 0b 
level 0
 Sent 16484432 bytes 10888 pkt (dropped 0, overlimits 0 requeues 0)
 rate 2172Kbit 179pps backlog 0b 0p requeues 0
 lended: 10888 borrowed: 0 giants: 0
 tokens: 18537 ctokens: 362

class htb 1:910 parent 1:900 leaf 910: prio 0

Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Denys

P.S.
dmesg
Linux version 2.6.23-insat1 ([EMAIL PROTECTED]) (gcc version 4.1.1 (Gentoo 
4.1.1-
r3)) #1 SMP Wed Oct 10 01:41:17 EEST 2007
BIOS-provided physical RAM map:
 BIOS-e820: 0100 - 000a (usable)
 BIOS-e820: 0010 - 3ffa8000 (usable)
 BIOS-e820: 3ffa8000 - 3ffb7c00 (ACPI data)
 BIOS-e820: 3ffb7c00 - 4000 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fe00 - 0001 (reserved)
127MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000fe710
Entering add_active_range(0, 0, 262056) 0 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  Normal   4096 ->   229376
  HighMem229376 ->   262056
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0:0 ->   262056
On node 0 totalpages: 262056
  DMA zone: 32 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 4064 pages, LIFO batch:0
  Normal zone: 1760 pages used for memmap
  Normal zone: 223520 pages, LIFO batch:31
  HighMem zone: 255 pages used for memmap
  HighMem zone: 32425 pages, LIFO batch:7
  Movable zone: 0 pages used for memmap
DMI 2.4 present.
Using APIC driver default
ACPI: RSDP 000F2620, 0024 (r2 DELL  )
ACPI: XSDT 000F26A0, 004C (r1 DELL   PE_SC3  1 DELL1)
ACPI: FACP 000F27A8, 00F4 (r3 DELL   PE_SC3  1 DELL1)
ACPI: DSDT 3FFA8000, 3C53 (r1 DELL   PE_SC3  1 MSFT  10E)
ACPI: FACS 3FFB7C00, 0040
ACPI: APIC 000F289C, 00E0 (r1 DELL   PE_SC3  1 DELL1)
ACPI: SPCR 000F297D, 0050 (r1 DELL   PE_SC3  1 DELL1)
ACPI: HPET 000F29CD, 0038 (r1 DELL   PE_SC3  1 DELL1)
ACPI: MCFG 000F2A05, 003C (r1 DELL   PE_SC3  1 DELL1)
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:6 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Processor #2 15:6 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:6 APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
Processor #3 15:6 APIC version 20
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x14] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x15] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x16] disabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x17] disabled)
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
ACPI: IOAPIC (id[0x04] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 4, version 32, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x05] address[0xfec8] gsi_base[32])
IOAPIC[1]: apic_id 5, version 32, address 0xfec8, GSI 32-55
ACPI: IOAPIC (id[0x06] address[0xfec81000] gsi_base[64])
IOAPIC[2]: apic_id 6, version 32, address 0xfec81000, GSI 64-87
ACPI: IOAPIC (id[0x07] address[0xfec82000] gsi_base[96])
IOAPIC[3]: apic_id 7, version 32, address 0xfec82000, GSI 96-119
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 4 I/O APICs
ACPI: HPET id: 0x8086a201 base: 0xfed0
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 5000 (gap: 4000:a000)
Built 1 zonelists in Zone order.  Total pages: 260009
Kernel command line: root=/dev/sda3 panic=10 nmi_watchdog=1
mapped APIC to b000 (fee0)
mapped IOAPIC to a000 (fec0)
mapped IOAPIC to 9000 (fec8)
mapped IOAPIC to 8000 (fec81000)
mapped IOAPIC to 7000 (fec82000)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c050e000 soft=c04ee000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Detected 3192.148 MHz processor.
Console: colour VGA+ 80x25
console [tty0] enabled
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 1033564k/1048224k available (2453k kernel code, 14048k reserved, 
1305k data, 232k init, 130720k highmem)
virtual kernel memory layout:
fixmap  : 0xffe14000 - 0xf000   (1964 kB)
pkmap   : 0xff80 - 0xffc0   (4096 kB)
vmalloc : 0xf880 - 0xff7fe000   ( 111 MB)
lowmem  : 0xc000 - 0xf800   ( 896 MB)
  .init : 0xc04b1000 - 0xc04eb000   ( 232 kB)
  .data : 0xc036544c - 0xc04ab97c   (1305 kB)
  .text : 0xc010 - 0xc036544c   (2453 kB)
Checking if this processor honours the

Re: Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Denys

Seems i am lost a bit. Now 2.6.22, i am not sure that working well. Possible 
it is related, that i booted kernel over kexec.

I will try to do full power cycle reboot if required, but it will cause for 
me serious downtime. Please tell me, which kernel prefferable to boot?

If it is interesting 
2.6.22 (but also non-functional now).

visp-1 ~ # cat /proc/net/psched
03e8 0400 000f4240 3b9aca00

maybe it is related to 
visp-1 ~ # dmesg|grep hpet
hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
hpet0: 3 64-bit timers, 14318180 Hz
Time: hpet clocksource has been installed.
hpet_resources: 0xfed0 is busy <<< - this?



Diff 2.6.22 -> 2.6.23 dmesg

--- log.2.6.22  2007-10-10 16:08:04.0 +0300
+++ log.2.6.23  2007-10-10 16:06:19.0 +0300
@@ -1,4 +1,4 @@
-Linux version 2.6.22-gentoo-r5-insat1 ([EMAIL PROTECTED]) (gcc version 4.1.1 
(Gentoo 4.1.1-r3)) #4 SMP Tue Sep 4 14:32:32 EEST 2007
+Linux version 2.6.23-insat1 ([EMAIL PROTECTED]) (gcc version 4.1.1 (Gentoo 
4.1.1-
r3)) #1 SMP Wed Oct 10 01:41:17 EEST 2007
 BIOS-provided physical RAM map:
  BIOS-e820: 0100 - 000a (usable)
  BIOS-e820: 0010 - 3ffa8000 (usable)
@@ -14,6 +14,7 @@
   DMA 0 -> 4096
   Normal   4096 ->   229376
   HighMem229376 ->   262056
+Movable zone start PFN for each node
 early_node_map[1] active PFN ranges
 0:0 ->   262056
 On node 0 totalpages: 262056
@@ -24,6 +25,7 @@
   Normal zone: 223520 pages, LIFO batch:31
   HighMem zone: 255 pages used for memmap
   HighMem zone: 32425 pages, LIFO batch:7
+  Movable zone: 0 pages used for memmap
 DMI 2.4 present.
 Using APIC driver default
 ACPI: RSDP 000F2620, 0024 (r2 DELL  )
@@ -74,45 +76,46 @@
 ACPI: HPET id: 0x8086a201 base: 0xfed0
 Using ACPI (MADT) for SMP configuration information
 Allocating PCI resources starting at 5000 (gap: 4000:a000)
-Built 1 zonelists.  Total pages: 260009
+Built 1 zonelists in Zone order.  Total pages: 260009
 Kernel command line: root=/dev/sda3 panic=10 nmi_watchdog=1
-mapped APIC to d000 (fee0)
-mapped IOAPIC to c000 (fec0)
-mapped IOAPIC to b000 (fec8)
-mapped IOAPIC to a000 (fec81000)
-mapped IOAPIC to 9000 (fec82000)
+mapped APIC to b000 (fee0)
+mapped IOAPIC to a000 (fec0)
+mapped IOAPIC to 9000 (fec8)
+mapped IOAPIC to 8000 (fec81000)
+mapped IOAPIC to 7000 (fec82000)
 Enabling fast FPU save and restore... done.
 Enabling unmasked SIMD FPU exception support... done.
 Initializing CPU#0
-CPU 0 irqstacks, hard=c0508000 soft=c04e8000
+CPU 0 irqstacks, hard=c050e000 soft=c04ee000
 PID hash table entries: 4096 (order: 12, 16384 bytes)
-Detected 3192.172 MHz processor.
+Detected 3192.148 MHz processor.
 Console: colour VGA+ 80x25
+console [tty0] enabled
 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
 Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
-Memory: 1033592k/1048224k available (2456k kernel code, 14036k reserved, 
1265k data, 236k init, 130720k highmem)
+Memory: 1033564k/1048224k available (2453k kernel code, 14048k reserved, 
1305k data, 232k init, 130720k highmem)
 virtual kernel memory layout:
-fixmap  : 0xffe16000 - 0xf000   (1956 kB)
+fixmap  : 0xffe14000 - 0xf000   (1964 kB)
 pkmap   : 0xff80 - 0xffc0   (4096 kB)
 vmalloc : 0xf880 - 0xff7fe000   ( 111 MB)
 lowmem  : 0xc000 - 0xf800   ( 896 MB)
-  .init : 0xc04a8000 - 0xc04e3000   ( 236 kB)
-  .data : 0xc03660d1 - 0xc04a289c   (1265 kB)
-  .text : 0xc010 - 0xc03660d1   (2456 kB)
+  .init : 0xc04b1000 - 0xc04eb000   ( 232 kB)
+  .data : 0xc036544c - 0xc04ab97c   (1305 kB)
+  .text : 0xc010 - 0xc036544c   (2453 kB)
 Checking if this processor honours the WP bit even in supervisor mode... Ok.
 SLUB: Genslabs=22, HWalign=64, Order=0-1, MinObjects=4, CPUs=4, Nodes=1
 hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
 hpet0: 3 64-bit timers, 14318180 Hz
-Calibrating delay using timer specific routine.. 6388.07 BogoMIPS 
(lpj=3194038)
+Calibrating delay using timer specific routine.. 6388.11 BogoMIPS 
(lpj=3194056)
 Mount-cache hash table entries: 512
-CPU: After generic identify, caps: bfebfbff 2010   
e43d  0001
+CPU: After generic identify, caps: bfebfbff 2010   
e43d  0001 
 monitor/mwait feature present.
 using mwait in idle threads.
 CPU: Trace cache: 12K uops, L1 D cache: 16K
 CPU: L2 cache: 2048K
 CPU: Physical Processor ID: 0
 CPU: Processor Core ID: 0
-CPU: After all inits, caps: bfebfbff 2010  b180 e43d 
 0001
+CPU: After all inits, caps: bfebfbff 2010  b180 e43d 
 0001 
 Intel machine check architecture supported.
 Intel machine check reporting enabled on CPU#0.
 CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
@@ -123,114 +126,81 @@
 ACPI: Core revision 20070126
 Parsing

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread jamal

On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote:

> I've always gotten very poor results when increasing the TX queue a
> lot, for example with NIU the point of diminishing returns seems to
> be in the range of 256-512 TX descriptor entries and this was with
> 1.6Ghz cpus.

Is it interupt per packet? From my experience, you may find interesting
results varying tx interupt mitigation parameters in addition to the
ring parameters.
Unfortunately when you do that, optimal parameters also depends on
packet size. so what may work for 64B, wont work well for 1400B.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [re] Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread jamal

On Wed, 2007-10-10 at 14:45 +0200, Patrick McHardy wrote:

> OK, hrtimers are disabled on your system, but we still announce
> the usec clock resolution to userspace, which is used by HTB to
> calculate the burst rate. But actually that can't be the reason
> since that has already been the case in 2.6.22. Please post a diff
> of the bootlog from 2.6.22 and 2.6.23.

Any possible relation to clock source? logs seem to indicate acpi
source; how does tsc or jiffies do?

BTW, I could be wrong about this, but iirc in a xeon i had access to i
saw that i could not guarantee the same clock source would be selected
across reboots in about 2.6.22.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Patrick McHardy

Denys wrote:
> Seems i am lost a bit. Now 2.6.22, i am not sure that working well. Possible 
> it is related, that i booted kernel over kexec.


Possibly.

> I will try to do full power cycle reboot if required, but it will cause for 
> me serious downtime. Please tell me, which kernel prefferable to boot?
> 
> If it is interesting 
> 2.6.22 (but also non-functional now).
> 
> visp-1 ~ # cat /proc/net/psched
> 03e8 0400 000f4240 3b9aca00
> 
> maybe it is related to 
> visp-1 ~ # dmesg|grep hpet
> hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
> hpet0: 3 64-bit timers, 14318180 Hz
> Time: hpet clocksource has been installed.
> hpet_resources: 0xfed0 is busy <<< - this?


Thats appears on both 2.6.22 and 2.6.23.

> --- log.2.6.22  2007-10-10 16:08:04.0 +0300
> +++ log.2.6.23  2007-10-10 16:06:19.0 +0300

> @@ -314,12 +235,20 @@
>  usbcore: registered new device driver usb
>  PCI: Using ACPI for IRQ routing
>  PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a 
> report
> +Time: hpet clocksource has been installed.
> +Clockevents: could not switch to one-shot mode: lapic is not functional.
> +Could not switch to high resolution mode on CPU 0
> +Clockevents: could not switch to one-shot mode:<6>Clockevents: could not 
> switch to one-shot mode: lapic is not functional.
> + lapic is not functional.
> +Could not switch to high resolution mode on CPU 2
> +Could not switch to high resolution mode on CPU 3
> +Clockevents: could not switch to one-shot mode: lapic is not functional.
> +Could not switch to high resolution mode on CPU 1
>  pnp: 00:08: ioport range 0x800-0x87f has been reserved
>  pnp: 00:08: ioport range 0x880-0x8bf has been reserved
>  pnp: 00:08: ioport range 0x8c0-0x8df has been reserved
>  pnp: 00:08: ioport range 0x8e0-0x8e3 has been reserved
>  pnp: 00:08: ioport range 0xc00-0xc7f has been reserved
> -Time: hpet clocksource has been installed.
>  pnp: 00:08: ioport range 0xca0-0xca7 has been reserved
>  pnp: 00:08: ioport range 0xca9-0xcab has been reserved
>  pnp: 00:08: ioport range 0xcad-0xcaf has been reserved


>  Real Time Clock Driver v1.12ac
> -[ACPI Debug]  String: [0x09] "HPET _CRS"
> -[ACPI Debug]  Buffer: [0x1C]
>  hpet_resources: 0xfed0 is busy
> -ACPI Error (utglobal-0126): Unknown exception code: 0xFFF0 [20070126]
>  intel_rng: FWH not detected
>  Hangcheck: starting hangcheck timer 0.9.0 (tick is 180 seconds, margin is 60 
> seconds).
>  Hangcheck: Using get_cycles().
> -input: Power Button (FF) as /class/input/input0
> -ACPI: Power Button (FF) [PWRF]
>  Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
> +serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> +serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
>  00:06: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
>  00:07: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
> +Clockevents: could not switch to one-shot mode:<6>Clockevents: could not 
> switch to one-shot mode:<6>Clockevents: could not switch to one-shot mode: 
> lapic is not functional.
> + lapic is not functional.
> +Could not switch to high resolution mode on CPU 3
> +Could not switch to high resolution mode on CPU 2
> +Clockevents: could not switch to one-shot mode: lapic is not functional.
> +Could not switch to high resolution mode on CPU 0
> + lapic is not functional.
> +Could not switch to high resolution mode on CPU 1


hrtimers seem to have worked on your system in 2.6.22 and not in
2.6.23 anymore. This patch should fix the incorrectly announced
/proc/net/psched timer resolution I mentioned earlier, causing HTB
to use larger burst rates by default, but that still won't be as
precise as with hrtimers.

Looking at the code, the reason for not using the lapic seems to
be nmi_watchdog=1:

+APIC timer registered as dummy, due to nmi_watchdog=1!
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index dee0d5f..8f1bcf6 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1225,10 +1225,13 @@ EXPORT_SYMBOL(tcf_destroy_chain);
 #ifdef CONFIG_PROC_FS
 static int psched_show(struct seq_file *seq, void *v)
 {
+   struct timespec ts;
+
+   hrtimer_get_res(CLOCK_MONOTONIC, &ts);
seq_printf(seq, "%08x %08x %08x %08x\n",
   (u32)NSEC_PER_USEC, (u32)PSCHED_US2NS(1),
   100,
-  (u32)NSEC_PER_SEC/(u32)ktime_to_ns(KTIME_MONOTONIC_RES));
+  (u32)NSEC_PER_SEC/(u32)ktime_to_ns(timespec_to_ktime(ts)));
 
return 0;
 }

Re: Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Denys

I did complete reboot(without kexec) to 2.6.23 (same configuration) and seems 
it is working better (not stuck as before to 60-70Mbit/s).

On all cases current_clocksources was hpet, just i tried to change it (doesnt 
help at all).

visp-1 ~ # cat /sys/devices/system/clocksource/clocksource0/
current_clocksource
hpet
visp-1 ~ # cat /sys/devices/system/clocksource/clocksource0/
available_clocksource
hpet acpi_pm jiffies tsc

from pcap analyser i wrote (just filter by expression "ip" counting bytes on 
eth0.1000):

82957/82957 KBit/S (53092530/53092530) (3887135786/3889389904)
82931/82931 KBit/S (53076469/53076469) (3940228316/3942482434)
82965/82965 KBit/S (53097615/53097615) (3993304785/3995558903)
82946/82946 KBit/S (53085988/53085988) (4046402400/4048656518)
82867/82867 KBit/S (53035341/53035341) (4099488388/4101742506)
82941/82941 KBit/S (53082260/53082260) (4152523729/4154777847)
82952/82952 KBit/S (53089348/53089348) (4205605989/4207860107)
82948/82945 KBit/S (53086915/53085415) (4258695337/4260949455)


visp-1 ~ # cat /proc/net/psched
03e8 0400 000f4240 3b9aca00

How i can help more?

--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [re] Possible 2.6.22 -> 2.6.23 HTB regression?

2007-10-10 Thread Denys

Patch applied. Rebooted over kexec to 2.6.23 without nmi_watchdog, for now 
all seems fine.

visp-1 ~ # cat /proc/net/psched
03e8 0400 000f4240 3b9aca00

82942/82949 KBit/S (53083445/53087945) (106109614/106105114)
82955/82955 KBit/S (53091631/53091631) (159193059/159193059)
82955/82955 KBit/S (53091351/53091351) (212284690/212284690)
82951/82951 KBit/S (53088902/53088902) (265376041/265376041)
82940/82940 KBit/S (53081605/53081605) (318464943/318464943)
82959/82959 KBit/S (53094269/53094269) (371546548/371546548)
81596/81596 KBit/S (52221918/52221918) (424640817/424640817)
82909/82909 KBit/S (53062055/53062055) (476862735/476862735)
82939/82939 KBit/S (53081402/53081402) (529924790/529924790)
82963/82963 KBit/S (53096554/53096554) (583006192/583006192)
82954/82954 KBit/S (53090871/53090871) (636102746/636102746)
82030/82943 KBit/S (52499816/53084066) (689193617/689193617)
82945/82945 KBit/S (53085182/53085182) (741693433/742277683)
82964/82954 KBit/S (53097002/53091002) (794778615/795362865)



--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Devel] [PATCH 1/5] net: Modify all rtnetlink methods to only work in the initial namespace

2007-10-10 Thread Daniel Lezcano


Denis V. Lunev wrote:

Eric W. Biederman wrote:

Before I can enable rtnetlink to work in all network namespaces
I need to be certain that something won't break.  So this
patch deliberately disables all of the rtnletlink methods in everything
except the initial network namespace.  After the methods have been
audited this extra check can be disabled.


[...]

 static int br_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct net *net = skb->sk->sk_net;
struct net_device *dev;
int idx;
 


I've read some code today greping 'init_net.loopback_dev' and found
interesting non-trivial for me issue.

Network namespace is extracted from the packet in two different ways in
TCP. This is a socket for outgoing path and a device for incoming.
Though, there are some places called uniformly both from incoming and
outgoing path.

Typical example is netfilters. They are called uniformly all around the
code. The prototype is the following:

static unsigned int reject6_target(struct sk_buff **pskb,
   const struct net_device *in,
   const struct net_device *out,
   unsigned int hooknum,
   const struct xt_target *target,
   const void *targinfo);



Thanks Denis for auditing the code.

As far as I see, struct net_device *in is NULL for outgoing traffic and 
struct net_device *out is NULL for ingress traffic. Except for the 
FORWARD rules where both are filled. If we are following network 
namespace semantic, we should not have two network devices belonging to 
two differents namespaces, right ?
In this case, the following line of code should be sufficient to 
retrieve the network namespace, no ?


struct net *net = in?in->nd_net:out->nd_net;


So, we are bound to the following options:
- perform additional non-uniform hacks around to place 'struct net' into
  other and other structures like xt_target
- add 7th parameter here and over
- introduce an skb_net field in the 'struct sk_buff' making all code
  uniform, at least when we have an skb

I think that this is not the last place with such a parameter list and
we should make a decision at this point when the code in not mainline yet.

As far as I understand, netfilters are not touched by the Eric and we
can face some non-trivial problems there.


In Eric's git tree:
http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-2.6-netns.git

There are some modifications concerning 
net/ipv4/netfiler/iptable_filter.c and at the ipt_hook function, there is:


struct net *net = (in?in:out)->nd_net;


So, if my point about uniformity is valid, this patchset looks wrong and
should be re-worked :(


As Eric said, we want to build the network namespace step by step, 
taking care of not breaking the init network namespace.


If you want to make iptables per namespace or catch problems before the 
code goes to Dave's tree, IMHO it will be more convenient to post to 
containers@ the patches against netns49, where the modifications will be 
in a network namespace big picture.


Regards.

  -- Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ip_local_port_range low > high check

2007-10-10 Thread Denis V. Lunev

This patch adds check low > high for ip_local_port_range.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 53ef0f4..686c0a4 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -186,6 +186,61 @@ static int strategy_allowed_congestion_control(ctl_table 
*table, int __user *nam
 
 }
 
+static int proc_port_range(ctl_table *table, int write, struct file *filp,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+   ctl_table tbl = {
+   .maxlen = sizeof(sysctl_local_port_range),
+   .extra1 = ip_local_port_range_min,
+   .extra2 = ip_local_port_range_max
+   };
+   tbl.data = kmalloc(tbl.maxlen, GFP_USER);
+   if (tbl.data == NULL)
+   return -ENOMEM;
+   memcpy(tbl.data, sysctl_local_port_range, tbl.maxlen);
+
+   ret = proc_dointvec_minmax(&tbl, write, filp, buffer, lenp, ppos);
+   if (write && ret == 0) {
+   int *data = (int *)tbl.data;
+   if (data[0] > data[1])
+   ret = -EINVAL;
+   else
+   memcpy(sysctl_local_port_range, data,
+   sizeof(sysctl_local_port_range));
+   }
+   kfree(tbl.data);
+   return ret;
+}
+
+int sysctl_strategy_port_range(ctl_table *table, int __user *name, int nlen,
+   void __user *oldval, size_t __user *oldlenp,
+   void __user *newval, size_t newlen)
+{
+   int ret;
+   ctl_table tbl = {
+   .maxlen = sizeof(sysctl_local_port_range),
+   .extra1 = ip_local_port_range_min,
+   .extra2 = ip_local_port_range_max
+   };
+   tbl.data = kmalloc(tbl.maxlen, GFP_USER);
+   if (tbl.data == NULL)
+   return -ENOMEM;
+   memcpy(tbl.data, sysctl_local_port_range, tbl.maxlen);
+
+   ret = sysctl_intvec(&tbl, name, nlen, oldval, oldlenp, newval, newlen);
+   if (ret == 0 && newval && newlen) {
+   int *data = (int *)tbl.data;
+   if (data[0] > data[1])
+   ret = -EINVAL;
+   else
+   memcpy(sysctl_local_port_range, data,
+   sizeof(sysctl_local_port_range));
+   }
+   kfree(tbl.data);
+   return ret;
+}
+
 ctl_table ipv4_table[] = {
{
.ctl_name   = NET_IPV4_TCP_TIMESTAMPS,
@@ -427,8 +482,8 @@ ctl_table ipv4_table[] = {
.data   = &sysctl_local_port_range,
.maxlen = sizeof(sysctl_local_port_range),
.mode   = 0644,
-   .proc_handler   = &proc_dointvec_minmax,
-   .strategy   = &sysctl_intvec,
+   .proc_handler   = &proc_port_range,
+   .strategy   = &sysctl_strategy_port_range,
.extra1 = ip_local_port_range_min,
.extra2 = ip_local_port_range_max
},
Warning: 1 path touched but unmodified. Consider running git-status.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] do not give access to 1-1024 ports for autobinding

2007-10-10 Thread Denis V. Lunev

This patch prevents possibility to give 1-1024 port range for autobinding.
{1, 1} may only takes some sense for deep embedded people.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>

--- ./net/ipv4/sysctl_net_ipv4.c.port2  2007-10-10 17:46:48.0 +0400
+++ ./net/ipv4/sysctl_net_ipv4.c2007-10-10 18:08:00.0 +0400
@@ -25,7 +25,7 @@ extern int sysctl_ip_nonlocal_bind;
 #ifdef CONFIG_SYSCTL
 static int zero;
 static int tcp_retr1_max = 255;
-static int ip_local_port_range_min[] = { 1, 1 };
+static int ip_local_port_range_min[] = { 1024, 1024 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
 #endif
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[NET_SCHED]: Show timer resolution instead of clock resolution in /proc/net/psched

2007-10-10 Thread Patrick McHardy


Fix incorrect HTB burst rate calculation in userspace when
clock and timer resolution differ. I guess this should go
in stable 2.6.22/23 as well.

[NET_SCHED]: Show timer resolution instead of clock resolution in 
/proc/net/psched

The fourth parameter of /proc/net/psched is supposed to show the timer resultion
and is used by HTB userspace to calculate the necessary burst rate. Currently
we show the clock resolution, which results in a too low burst rate when the
two differ.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit a3885788169f2f70634f8142344e5131ccf32595
tree 62bcf28c9706547228521dc4402ebea273326331
parent 0e52ab8ceb41df2104279938484267ab474286d1
author Patrick McHardy <[EMAIL PROTECTED]> Wed, 10 Oct 2007 16:29:14 +0200
committer Patrick McHardy <[EMAIL PROTECTED]> Wed, 10 Oct 2007 16:29:14 +0200

 net/sched/sch_api.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index dee0d5f..8f1bcf6 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1225,10 +1225,13 @@ EXPORT_SYMBOL(tcf_destroy_chain);
 #ifdef CONFIG_PROC_FS
 static int psched_show(struct seq_file *seq, void *v)
 {
+   struct timespec ts;
+
+   hrtimer_get_res(CLOCK_MONOTONIC, &ts);
seq_printf(seq, "%08x %08x %08x %08x\n",
   (u32)NSEC_PER_USEC, (u32)PSCHED_US2NS(1),
   100,
-  (u32)NSEC_PER_SEC/(u32)ktime_to_ns(KTIME_MONOTONIC_RES));
+  (u32)NSEC_PER_SEC/(u32)ktime_to_ns(timespec_to_ktime(ts)));
 
return 0;
 }

Re: [PATCH] Evict tmp variable from the stack in ip6_evictor

2007-10-10 Thread Patrick McHardy


Pavel Emelyanov wrote:

The list_head *tmp is used to help getting the first entry in
the ip6_frag_lru_list list. There is a simpler way to do it



The exact same code exists in ip_fragment.c and nf_conntrack_reasm.c,
please also change it there.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[0/7] IPsec: More input/output clean-ups

2007-10-10 Thread Herbert Xu

Hi Dave:

Here's a few more clean-up's on the IPsec input/output path.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] [IPSEC] esp: Remove NAT-T checksum invalidation for BEET

2007-10-10 Thread Herbert Xu

[IPSEC] esp: Remove NAT-T checksum invalidation for BEET

I pointed this out back when this patch was first proposed but it looks like
it got lost along the way.

The checksum only needs to be ignored for NAT-T in transport mode where
we lose the original inner addresses due to NAT.  With BEET the inner
addresses will be intact so the checksum remains valid.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 net/ipv4/esp4.c |3 +--
 1 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 452910d..1af332d 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -261,8 +261,7 @@ static int esp_input(struct xfrm_state *x, struct sk_buff 
*skb)
 *as per draft-ietf-ipsec-udp-encaps-06,
 *section 3.1.2
 */
-   if (x->props.mode == XFRM_MODE_TRANSPORT ||
-   x->props.mode == XFRM_MODE_BEET)
+   if (x->props.mode == XFRM_MODE_TRANSPORT)
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] [IPSEC] beet: Fix extension header support on output

2007-10-10 Thread Herbert Xu

[IPSEC] beet: Fix extension header support on output

The beet output function completely kills any extension headers by replacing
them with the IPv6 header.  This is because it essentially ignores the
result of ip6_find_1stfragopt by simply acting as if there aren't any
extension headers.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 net/ipv6/xfrm6_mode_beet.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_mode_beet.c b/net/ipv6/xfrm6_mode_beet.c
index 65e6b2a..d9366df 100644
--- a/net/ipv6/xfrm6_mode_beet.c
+++ b/net/ipv6/xfrm6_mode_beet.c
@@ -44,9 +44,9 @@ static int xfrm6_beet_output(struct xfrm_state *x, struct 
sk_buff *skb)
hdr_len = ip6_find_1stfragopt(skb, &prevhdr);
memmove(skb->data, iph, hdr_len);
 
-   skb_set_mac_header(skb, offsetof(struct ipv6hdr, nexthdr));
+   skb_set_mac_header(skb, (prevhdr - x->props.header_len) - skb->data);
skb_reset_network_header(skb);
-   skb_set_transport_header(skb, sizeof(struct ipv6hdr));
+   skb_set_transport_header(skb, hdr_len);
top_iph = ipv6_hdr(skb);
 
ipv6_addr_copy(&top_iph->saddr, (struct in6_addr *)&x->props.saddr);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/7] [IPSEC]: Set skb->data to payload in x->mode->output

2007-10-10 Thread Herbert Xu

[IPSEC]: Set skb->data to payload in x->mode->output

This patch changes the calling convention so that on entry from
x->mode->output and before entry into x->type->output skb->data
will point to the payload instead of the IP header.

This is essentially a redistribution of skb_push/skb_pull calls
with the aim of minimising them on the common path of tunnel +
ESP.

It'll also let us use the same calling convention between IPv4
and IPv6 with the next patch.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 net/ipv4/ah4.c  |1 +
 net/ipv4/esp4.c |6 ++
 net/ipv4/ipcomp.c   |1 +
 net/ipv4/xfrm4_mode_beet.c  |5 +++--
 net/ipv4/xfrm4_mode_transport.c |4 ++--
 net/ipv4/xfrm4_mode_tunnel.c|3 +--
 net/ipv4/xfrm4_tunnel.c |1 +
 net/ipv6/ah6.c  |1 +
 net/ipv6/esp6.c |9 ++---
 net/ipv6/ipcomp6.c  |5 -
 net/ipv6/mip6.c |2 ++
 net/ipv6/xfrm6_mode_beet.c  |   13 +++--
 net/ipv6/xfrm6_mode_ro.c|   12 ++--
 net/ipv6/xfrm6_mode_transport.c |   12 ++--
 net/ipv6/xfrm6_mode_tunnel.c|   13 +++--
 net/ipv6/xfrm6_tunnel.c |1 +
 16 files changed, 47 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 3513149..dbb1f11 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -66,6 +66,7 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
charbuf[60];
} tmp_iph;
 
+   skb_push(skb, -skb_network_offset(skb));
top_iph = ip_hdr(skb);
iph = &tmp_iph.iph;
 
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 1af332d..0f5e838 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -28,9 +28,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
int alen;
int nfrags;
 
-   /* Strip IP+ESP header. */
-   __skb_pull(skb, skb_transport_offset(skb));
-   /* Now skb is pure payload to encrypt */
+   /* skb is pure payload to encrypt */
 
err = -ENOMEM;
 
@@ -60,7 +58,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
tail[clen - skb->len - 2] = (clen - skb->len) - 2;
pskb_put(skb, trailer, clen - skb->len);
 
-   __skb_push(skb, -skb_network_offset(skb));
+   skb_push(skb, -skb_network_offset(skb));
top_iph = ip_hdr(skb);
esph = (struct ip_esp_hdr *)(skb_network_header(skb) +
 top_iph->ihl * 4);
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index e787044..1929d45 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -134,6 +134,7 @@ static int ipcomp_output(struct xfrm_state *x, struct 
sk_buff *skb)
int hdr_len = 0;
struct iphdr *iph = ip_hdr(skb);
 
+   skb_push(skb, -skb_network_offset(skb));
iph->tot_len = htons(skb->len);
hdr_len = iph->ihl * 4;
if ((skb->len - hdr_len) < ipcd->threshold) {
diff --git a/net/ipv4/xfrm4_mode_beet.c b/net/ipv4/xfrm4_mode_beet.c
index a73e710..77888f5 100644
--- a/net/ipv4/xfrm4_mode_beet.c
+++ b/net/ipv4/xfrm4_mode_beet.c
@@ -40,10 +40,11 @@ static int xfrm4_beet_output(struct xfrm_state *x, struct 
sk_buff *skb)
if (unlikely(optlen))
hdrlen += IPV4_BEET_PHMAXLEN - (optlen & 4);
 
-   skb_push(skb, x->props.header_len - IPV4_BEET_PHMAXLEN + hdrlen);
-   skb_reset_network_header(skb);
+   skb_set_network_header(skb, IPV4_BEET_PHMAXLEN - x->props.header_len -
+   hdrlen);
top_iph = ip_hdr(skb);
skb->transport_header += sizeof(*iph) - hdrlen;
+   __skb_pull(skb, sizeof(*iph) - hdrlen);
 
memmove(top_iph, iph, sizeof(*iph));
if (unlikely(optlen)) {
diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index 6010471..10499d2 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -27,8 +27,8 @@ static int xfrm4_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
int ihl = iph->ihl * 4;
 
skb->transport_header = skb->network_header + ihl;
-   skb_push(skb, x->props.header_len);
-   skb_reset_network_header(skb);
+   skb_set_network_header(skb, -x->props.header_len);
+   __skb_pull(skb, ihl);
memmove(skb_network_header(skb), iph, ihl);
return 0;
 }
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index 9963700..bac1a91 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -49,8 +49,7 @@ static int xfrm4_tunnel_output(struct xfrm_state *x, struct 
sk_buff *skb)
iph = ip_hdr(skb);
skb->transport_header = skb->network_header;
 
-   skb_push(skb, x->props.header_len);
-   skb_reset_network_header(skb);
+   skb_set_network_header(skb, -x->props.header_len);
top_iph = ip_hdr(skb);
 
top_i

[PATCH 5/7] [IPSEC]: Get rid of ipv6_{auth,esp,comp}_hdr

2007-10-10 Thread Herbert Xu

[IPSEC]: Get rid of ipv6_{auth,esp,comp}_hdr

This patch removes the duplicate ipv6_{auth,esp,comp}_hdr structures since
they're identical to the IPv4 versions.  Duplicating them would only create
problems for ourselves later when we need to add things like extended
sequence numbers.

I've also added transport header type conversion headers for these types
which are now used by the transforms.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 include/linux/ipv6.h |   21 -
 include/net/ah.h |7 +++
 include/net/esp.h|7 +++
 include/net/ipcomp.h |   11 ++-
 net/ipv4/ah4.c   |   18 +-
 net/ipv4/esp4.c  |   10 +-
 net/ipv4/ipcomp.c|2 +-
 net/ipv6/ah6.c   |   16 
 net/ipv6/esp6.c  |   18 +-
 net/ipv6/ipcomp6.c   |   17 -
 10 files changed, 64 insertions(+), 63 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 4ca60c3..5d35a4c 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -96,27 +96,6 @@ struct ipv6_destopt_hao {
struct in6_addr addr;
 } __attribute__ ((__packed__));
 
-struct ipv6_auth_hdr {
-   __u8  nexthdr;
-   __u8  hdrlen;   /* This one is measured in 32 bit units! */
-   __be16 reserved;
-   __be32 spi;
-   __be32 seq_no;   /* Sequence number */
-   __u8  auth_data[0]; /* Length variable but >=4. Mind the 64 bit 
alignment! */
-};
-
-struct ipv6_esp_hdr {
-   __be32 spi;
-   __be32 seq_no;   /* Sequence number */
-   __u8  enc_data[0];  /* Length variable but >=8. Mind the 64 bit 
alignment! */
-};
-
-struct ipv6_comp_hdr {
-   __u8 nexthdr;
-   __u8 flags;
-   __be16 cpi;
-};
-
 /*
  * IPv6 fixed header
  *
diff --git a/include/net/ah.h b/include/net/ah.h
index 5e758c2..ae1c322 100644
--- a/include/net/ah.h
+++ b/include/net/ah.h
@@ -38,4 +38,11 @@ out:
return err;
 }
 
+struct ip_auth_hdr;
+
+static inline struct ip_auth_hdr *ip_auth_hdr(const struct sk_buff *skb)
+{
+   return (struct ip_auth_hdr *)skb_transport_header(skb);
+}
+
 #endif
diff --git a/include/net/esp.h b/include/net/esp.h
index e793d76..c1bc529 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -53,4 +53,11 @@ static inline int esp_mac_digest(struct esp_data *esp, 
struct sk_buff *skb,
return crypto_hash_final(&desc, esp->auth.work_icv);
 }
 
+struct ip_esp_hdr;
+
+static inline struct ip_esp_hdr *ip_esp_hdr(const struct sk_buff *skb)
+{
+   return (struct ip_esp_hdr *)skb_transport_header(skb);
+}
+
 #endif
diff --git a/include/net/ipcomp.h b/include/net/ipcomp.h
index 87c1af3..330b74e 100644
--- a/include/net/ipcomp.h
+++ b/include/net/ipcomp.h
@@ -1,14 +1,23 @@
 #ifndef _NET_IPCOMP_H
 #define _NET_IPCOMP_H
 
-#include 
 #include 
 
 #define IPCOMP_SCRATCH_SIZE 65400
 
+struct crypto_comp;
+
 struct ipcomp_data {
u16 threshold;
struct crypto_comp **tfms;
 };
 
+struct ip_comp_hdr;
+struct sk_buff;
+
+static inline struct ip_comp_hdr *ip_comp_hdr(const struct sk_buff *skb)
+{
+   return (struct ip_comp_hdr *)skb_transport_header(skb);
+}
+
 #endif
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index e4f7aa3..d697064 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -82,7 +82,7 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
goto error;
}
 
-   ah = (struct ip_auth_hdr *)skb_transport_header(skb);
+   ah = ip_auth_hdr(skb);
ah->nexthdr = *skb_mac_header(skb);
*skb_mac_header(skb) = IPPROTO_AH;
 
@@ -93,8 +93,7 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
top_iph->check = 0;
 
ahp = x->data;
-   ah->hdrlen  = (XFRM_ALIGN8(sizeof(struct ip_auth_hdr) +
-  ahp->icv_trunc_len) >> 2) - 2;
+   ah->hdrlen  = (XFRM_ALIGN8(sizeof(*ah) + ahp->icv_trunc_len) >> 2) - 2;
 
ah->reserved = 0;
ah->spi = x->id.spi;
@@ -134,15 +133,15 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
struct ah_data *ahp;
char work_buf[60];
 
-   if (!pskb_may_pull(skb, sizeof(struct ip_auth_hdr)))
+   if (!pskb_may_pull(skb, sizeof(*ah)))
goto out;
 
-   ah = (struct ip_auth_hdr*)skb->data;
+   ah = (struct ip_auth_hdr *)skb->data;
ahp = x->data;
ah_hlen = (ah->hdrlen + 2) << 2;
 
-   if (ah_hlen != XFRM_ALIGN8(sizeof(struct ip_auth_hdr) + 
ahp->icv_full_len) &&
-   ah_hlen != XFRM_ALIGN8(sizeof(struct ip_auth_hdr) + 
ahp->icv_trunc_len))
+   if (ah_hlen != XFRM_ALIGN8(sizeof(*ah) + ahp->icv_full_len) &&
+   ah_hlen != XFRM_ALIGN8(sizeof(*ah) + ahp->icv_trunc_len))
goto out;
 
if (!pskb_may_pull(skb, ah_hlen))
@@ -156,7 +155,7 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
 
skb->ip_summed = CHECKSUM_NONE;

[PATCH 6/7] [IPSEC]: Move IP length/checksum setting out of transforms

2007-10-10 Thread Herbert Xu

[IPSEC]: Move IP length/checksum setting out of transforms

This patch moves the setting of the IP length and checksum fields out of
the transforms and into the xfrmX_output functions.  This would help future
efforts in merging the transforms themselves.

It also adds an optimisation to ipcomp due to the fact that the transport
offset is guaranteed to be zero.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 net/ipv4/ah4.c   |2 --
 net/ipv4/esp4.c  |7 +--
 net/ipv4/ipcomp.c|   22 +-
 net/ipv4/xfrm4_mode_beet.c   |3 ---
 net/ipv4/xfrm4_mode_tunnel.c |5 +
 net/ipv4/xfrm4_output.c  |5 +
 net/ipv4/xfrm4_tunnel.c  |5 -
 net/ipv6/esp6.c  |3 ---
 net/ipv6/ipcomp6.c   |   19 ++-
 net/ipv6/mip6.c  |2 --
 net/ipv6/xfrm6_mode_beet.c   |2 --
 net/ipv6/xfrm6_mode_tunnel.c |4 +---
 net/ipv6/xfrm6_output.c  |4 
 net/ipv6/xfrm6_tunnel.c  |5 -
 14 files changed, 23 insertions(+), 65 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index d697064..60925fe 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -115,8 +115,6 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
memcpy(top_iph+1, iph+1, top_iph->ihl*4 - sizeof(struct iphdr));
}
 
-   ip_send_check(top_iph);
-
err = 0;
 
 error:
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 66eb496..8377bed 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -16,7 +16,6 @@
 static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
 {
int err;
-   struct iphdr *top_iph;
struct ip_esp_hdr *esph;
struct crypto_blkcipher *tfm;
struct blkcipher_desc desc;
@@ -59,9 +58,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
pskb_put(skb, trailer, clen - skb->len);
 
skb_push(skb, -skb_network_offset(skb));
-   top_iph = ip_hdr(skb);
esph = ip_esp_hdr(skb);
-   top_iph->tot_len = htons(skb->len + alen);
*(skb_tail_pointer(trailer) - 1) = *skb_mac_header(skb);
*skb_mac_header(skb) = IPPROTO_ESP;
 
@@ -76,7 +73,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
uh = (struct udphdr *)esph;
uh->source = encap->encap_sport;
uh->dest = encap->encap_dport;
-   uh->len = htons(skb->len + alen - top_iph->ihl*4);
+   uh->len = htons(skb->len + alen - skb_transport_offset(skb));
uh->check = 0;
 
switch (encap->encap_type) {
@@ -136,8 +133,6 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
 unlock:
spin_unlock_bh(&x->lock);
 
-   ip_send_check(top_iph);
-
 error:
return err;
 }
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 78d6ddb..32b02de 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -98,10 +98,9 @@ out:
 static int ipcomp_compress(struct xfrm_state *x, struct sk_buff *skb)
 {
struct ipcomp_data *ipcd = x->data;
-   const int ihlen = skb_transport_offset(skb);
-   const int plen = skb->len - ihlen;
+   const int plen = skb->len;
int dlen = IPCOMP_SCRATCH_SIZE;
-   u8 *start = skb_transport_header(skb);
+   u8 *start = skb->data;
const int cpu = get_cpu();
u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
@@ -118,7 +117,7 @@ static int ipcomp_compress(struct xfrm_state *x, struct 
sk_buff *skb)
memcpy(start + sizeof(struct ip_comp_hdr), scratch, dlen);
put_cpu();
 
-   pskb_trim(skb, ihlen + dlen + sizeof(struct ip_comp_hdr));
+   pskb_trim(skb, dlen + sizeof(struct ip_comp_hdr));
return 0;
 
 out:
@@ -131,13 +130,8 @@ static int ipcomp_output(struct xfrm_state *x, struct 
sk_buff *skb)
int err;
struct ip_comp_hdr *ipch;
struct ipcomp_data *ipcd = x->data;
-   int hdr_len = 0;
-   struct iphdr *iph = ip_hdr(skb);
 
-   skb_push(skb, -skb_network_offset(skb));
-   iph->tot_len = htons(skb->len);
-   hdr_len = iph->ihl * 4;
-   if ((skb->len - hdr_len) < ipcd->threshold) {
+   if (skb->len < ipcd->threshold) {
/* Don't bother compressing */
goto out_ok;
}
@@ -146,25 +140,19 @@ static int ipcomp_output(struct xfrm_state *x, struct 
sk_buff *skb)
goto out_ok;
 
err = ipcomp_compress(x, skb);
-   iph = ip_hdr(skb);
 
if (err) {
goto out_ok;
}
 
/* Install ipcomp header, convert into ipcomp datagram. */
-   iph->tot_len = htons(skb->len);
ipch = ip_comp_hdr(skb);
ipch->nexthdr = *skb_mac_header(skb);
ipch->flags = 0;
ipch->cpi = htons((u16 )ntohl(x->id.spi));
*skb_mac_header(skb) = IPPROTO_COMP;
-   ip_send_

[PATCH 7/7] [IPSEC]: Move IP protocol setting from transforms into xfrm4_input.c

2007-10-10 Thread Herbert Xu

[IPSEC]: Move IP protocol setting from transforms into xfrm4_input.c

This patch makes the IPv4 x->type->input functions return the next protocol
instead of setting it directly.  This is identical to how we do things in
IPv6 and will help us merge common code on the input path.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 net/ipv4/ah4.c  |5 +++--
 net/ipv4/esp4.c |3 +--
 net/ipv4/ipcomp.c   |7 ---
 net/ipv4/xfrm4_input.c  |7 ++-
 net/ipv4/xfrm4_tunnel.c |2 +-
 5 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 60925fe..4e8e3b0 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -125,6 +125,7 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
 {
int ah_hlen;
int ihl;
+   int nexthdr;
int err = -EINVAL;
struct iphdr *iph;
struct ip_auth_hdr *ah;
@@ -136,6 +137,7 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
 
ah = (struct ip_auth_hdr *)skb->data;
ahp = x->data;
+   nexthdr = ah->nexthdr;
ah_hlen = (ah->hdrlen + 2) << 2;
 
if (ah_hlen != XFRM_ALIGN8(sizeof(*ah) + ahp->icv_full_len) &&
@@ -182,13 +184,12 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
goto out;
}
}
-   ((struct iphdr*)work_buf)->protocol = ah->nexthdr;
skb->network_header += ah_hlen;
memcpy(skb_network_header(skb), work_buf, ihl);
skb->transport_header = skb->network_header;
__skb_pull(skb, ah_hlen + ihl);
 
-   return 0;
+   return nexthdr;
 
 out:
return err;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 8377bed..6b1a31a 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -257,12 +257,11 @@ static int esp_input(struct xfrm_state *x, struct sk_buff 
*skb)
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
 
-   iph->protocol = nexthdr[1];
pskb_trim(skb, skb->len - alen - padlen - 2);
__skb_pull(skb, sizeof(*esph) + esp->conf.ivlen);
skb_set_transport_header(skb, -ihl);
 
-   return 0;
+   return nexthdr[1];
 
 out:
return -EINVAL;
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 32b02de..0bfeb02 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -75,7 +75,6 @@ out:
 static int ipcomp_input(struct xfrm_state *x, struct sk_buff *skb)
 {
int err = -ENOMEM;
-   struct iphdr *iph;
struct ip_comp_hdr *ipch;
 
if (skb_linearize_cow(skb))
@@ -84,12 +83,14 @@ static int ipcomp_input(struct xfrm_state *x, struct 
sk_buff *skb)
skb->ip_summed = CHECKSUM_NONE;
 
/* Remove ipcomp header and decompress original payload */
-   iph = ip_hdr(skb);
ipch = (void *)skb->data;
-   iph->protocol = ipch->nexthdr;
skb->transport_header = skb->network_header + sizeof(*ipch);
__skb_pull(skb, sizeof(*ipch));
err = ipcomp_decompress(x, skb);
+   if (err)
+   goto out;
+
+   err = ipch->nexthdr;
 
 out:
return err;
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 2fa1082..e9bbfde 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -54,12 +54,14 @@ static int xfrm4_rcv_encap(struct sk_buff *skb, __u16 
encap_type)
int xfrm_nr = 0;
int decaps = 0;
int err = xfrm4_parse_spi(skb, ip_hdr(skb)->protocol, &spi, &seq);
+   unsigned int nhoff = offsetof(struct iphdr, protocol);
 
if (err != 0)
goto drop;
 
do {
const struct iphdr *iph = ip_hdr(skb);
+   int nexthdr;
 
if (xfrm_nr == XFRM_MAX_DEPTH)
goto drop;
@@ -82,9 +84,12 @@ static int xfrm4_rcv_encap(struct sk_buff *skb, __u16 
encap_type)
if (xfrm_state_check_expire(x))
goto drop_unlock;
 
-   if (x->type->input(x, skb))
+   nexthdr = x->type->input(x, skb);
+   if (nexthdr <= 0)
goto drop_unlock;
 
+   skb_network_header(skb)[nhoff] = nexthdr;
+
/* only the first xfrm gets the encap type */
encap_type = 0;
 
diff --git a/net/ipv4/xfrm4_tunnel.c b/net/ipv4/xfrm4_tunnel.c
index e1fafc1..1312417 100644
--- a/net/ipv4/xfrm4_tunnel.c
+++ b/net/ipv4/xfrm4_tunnel.c
@@ -18,7 +18,7 @@ static int ipip_output(struct xfrm_state *x, struct sk_buff 
*skb)
 
 static int ipip_xfrm_rcv(struct xfrm_state *x, struct sk_buff *skb)
 {
-   return 0;
+   return IPPROTO_IP;
 }
 
 static int ipip_init_state(struct xfrm_state *x)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] [IPSEC]: Use IPv6 calling convention as the convention for x->mode->output

2007-10-10 Thread Herbert Xu

[IPSEC]: Use IPv6 calling convention as the convention for x->mode->output

The IPv6 calling convention for x->mode->output is more general and could
help an eventual protocol-generic x->type->output implementation.  This
patch adopts it for IPv4 as well and modifies the IPv4 type output functions
accordingly.

It also rewrites the IPv6 mac/transport header calculation to be based off
the network header where practical.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
---

 include/net/xfrm.h  |   12 
 net/ipv4/ah4.c  |6 +++---
 net/ipv4/esp4.c |   11 +--
 net/ipv4/ipcomp.c   |   10 +-
 net/ipv4/xfrm4_mode_beet.c  |   17 +++--
 net/ipv4/xfrm4_mode_transport.c |7 +++
 net/ipv4/xfrm4_mode_tunnel.c|7 +++
 net/ipv6/xfrm6_mode_beet.c  |9 +
 net/ipv6/xfrm6_mode_ro.c|9 +
 net/ipv6/xfrm6_mode_transport.c |9 +
 net/ipv6/xfrm6_mode_tunnel.c|   14 +++---
 11 files changed, 44 insertions(+), 67 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 1c116dc..77be396 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -300,6 +300,18 @@ extern void xfrm_put_type(struct xfrm_type *type);
 
 struct xfrm_mode {
int (*input)(struct xfrm_state *x, struct sk_buff *skb);
+
+   /*
+* Add encapsulation header.
+*
+* On exit, the transport header will be set to the start of the
+* encapsulation header to be filled in by x->type->output and
+* the mac header will be set to the nextheader (protocol for
+* IPv4) field of the extension header directly preceding the
+* encapsulation header, or in its absence, that of the top IP
+* header.  The value of the network header will always point
+* to the top IP header while skb->data will point to the payload.
+*/
int (*output)(struct xfrm_state *x,struct sk_buff *skb);
 
struct module *owner;
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index dbb1f11..e4f7aa3 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -82,14 +82,14 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
goto error;
}
 
-   ah = (struct ip_auth_hdr *)((char *)top_iph+top_iph->ihl*4);
-   ah->nexthdr = top_iph->protocol;
+   ah = (struct ip_auth_hdr *)skb_transport_header(skb);
+   ah->nexthdr = *skb_mac_header(skb);
+   *skb_mac_header(skb) = IPPROTO_AH;
 
top_iph->tos = 0;
top_iph->tot_len = htons(skb->len);
top_iph->frag_off = 0;
top_iph->ttl = 0;
-   top_iph->protocol = IPPROTO_AH;
top_iph->check = 0;
 
ahp = x->data;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 0f5e838..93153d1 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -60,10 +60,10 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
 
skb_push(skb, -skb_network_offset(skb));
top_iph = ip_hdr(skb);
-   esph = (struct ip_esp_hdr *)(skb_network_header(skb) +
-top_iph->ihl * 4);
+   esph = (struct ip_esp_hdr *)skb_transport_header(skb);
top_iph->tot_len = htons(skb->len + alen);
-   *(skb_tail_pointer(trailer) - 1) = top_iph->protocol;
+   *(skb_tail_pointer(trailer) - 1) = *skb_mac_header(skb);
+   *skb_mac_header(skb) = IPPROTO_ESP;
 
spin_lock_bh(&x->lock);
 
@@ -91,9 +91,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
break;
}
 
-   top_iph->protocol = IPPROTO_UDP;
-   } else
-   top_iph->protocol = IPPROTO_ESP;
+   *skb_mac_header(skb) = IPPROTO_UDP;
+   }
 
esph->spi = x->id.spi;
esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq);
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 1929d45..bf74f64 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -98,10 +98,10 @@ out:
 static int ipcomp_compress(struct xfrm_state *x, struct sk_buff *skb)
 {
struct ipcomp_data *ipcd = x->data;
-   const int ihlen = ip_hdrlen(skb);
+   const int ihlen = skb_transport_offset(skb);
const int plen = skb->len - ihlen;
int dlen = IPCOMP_SCRATCH_SIZE;
-   u8 *start = skb->data + ihlen;
+   u8 *start = skb_transport_header(skb);
const int cpu = get_cpu();
u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
@@ -154,11 +154,11 @@ static int ipcomp_output(struct xfrm_state *x, struct 
sk_buff *skb)
 
/* Install ipcomp header, convert into ipcomp datagram. */
iph->tot_len = htons(skb->len);
-   ipch = (struct ip_comp_hdr *)((char *)iph + iph->ihl * 4);
-   ipch->nexthdr = iph->protocol;
+   ipch = (struct ip_comp_hdr *)skb_transport_header(skb);
+   ipch->nexthdr =

Re: [Devel] [PATCH 1/5] net: Modify all rtnetlink methods to only work in the initial namespace

2007-10-10 Thread Denis V. Lunev

Daniel Lezcano wrote:
> struct net *net = in?in->nd_net:out->nd_net;
> 
>> So, we are bound to the following options:
>> - perform additional non-uniform hacks around to place 'struct net' into
>>   other and other structures like xt_target
>> - add 7th parameter here and over
>> - introduce an skb_net field in the 'struct sk_buff' making all code
>>   uniform, at least when we have an skb
>>
>> I think that this is not the last place with such a parameter list and
>> we should make a decision at this point when the code in not mainline
>> yet.
>>
>> As far as I understand, netfilters are not touched by the Eric and we
>> can face some non-trivial problems there.
> 
> In Eric's git tree:
> http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-2.6-netns.git
> 
> There are some modifications concerning
> net/ipv4/netfiler/iptable_filter.c and at the ipt_hook function, there is:
> 
> struct net *net = (in?in:out)->nd_net;
> 
>> So, if my point about uniformity is valid, this patchset looks wrong and
>> should be re-worked :(
> 
> As Eric said, we want to build the network namespace step by step,
> taking care of not breaking the init network namespace.
> 
> If you want to make iptables per namespace or catch problems before the
> code goes to Dave's tree, IMHO it will be more convenient to post to
> containers@ the patches against netns49, where the modifications will be
> in a network namespace big picture.
> 

my point is somewhat another. Yes, this is enough for that place. If so,
I must scatter these checks all around in the netfilters code. Brr.

In forward chain the situation is different for Layer3 switching. Let's
assume that we have an OpenVZ scheme, where the packet flows from socket
to device and after that from device to device via forwarding path. You
can't call skb_orphan on namespace switching as this breaks UDP flow
regulation. Virtual network device is fast while real Ethernet is slow,
packets will be dropped on queue in real device. So, the situation with
packet on send path with a socket from other namespace is possible :(

I just pray for uniformity to concentrate on the code rather than on
guesses on which path we are :(

Regards,
Den
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Evict tmp variable from the stack in ip6_evictor

2007-10-10 Thread Pavel Emelyanov

Patrick McHardy wrote:
> Pavel Emelyanov wrote:
>> The list_head *tmp is used to help getting the first entry in
>> the ip6_frag_lru_list list. There is a simpler way to do it
> 
> 
> The exact same code exists in ip_fragment.c and nf_conntrack_reasm.c,
> please also change it there.

Hm, indeed. But I see that the structs frag_queue in reassembly.c, ipq 
in ip_fragment.c and nf_ct_frag6_queue in nf code looks VERY similar 
and very much of code (like link/unlink or evict) looks the same too.

Maybe it's worth creating something like struct skb_fragment and
consolidate all the common stuff into some net/core/lib_frag.c? Or
is there some hidden reason for keeping this code splitted?

Thanks, 
Pavel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Evict tmp variable from the stack in ip6_evictor

2007-10-10 Thread Pavel Emelyanov

The list_head *tmp is used to help getting the first entry in
the ip6_frag_lru_list list. There is a simpler way to do it.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 31601c9..8fad98b 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -261,7 +261,6 @@ static __inline__ void fq_kill(struct fr
 static void ip6_evictor(struct inet6_dev *idev)
 {
struct frag_queue *fq;
-   struct list_head *tmp;
int work;
 
work = atomic_read(&ip6_frag_mem) - sysctl_ip6frag_low_thresh;
@@ -274,8 +273,9 @@ static void ip6_evictor(struct inet6_dev
read_unlock(&ip6_frag_lock);
return;
}
-   tmp = ip6_frag_lru_list.next;
-   fq = list_entry(tmp, struct frag_queue, lru_list);
+
+   fq = list_first_entry(&ip6_frag_lru_list,
+   struct frag_queue, lru_list);
atomic_inc(&fq->refcnt);
read_unlock(&ip6_frag_lock);
 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] division-by-zero in inet_csk_get_port

2007-10-10 Thread Brian Haley


Anton Arapov wrote:

  So, now the way suggested by Denis looks reasonable.

  What do you think?


If that's the case then you should fix __udp_lib_get_port() the same way.

Prevent division by zero in __udp_lib_get_port() when only one 
unsecured port is available.


-Brian


Signed-off-by: Brian Haley <[EMAIL PROTECTED]>

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ef4d901..61faa38 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -150,10 +150,11 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
 		int i;
 		int low = sysctl_local_port_range[0];
 		int high = sysctl_local_port_range[1];
+		int remaining = (high - low) + 1;
 		unsigned rover, best, best_size_so_far;
 
 		best_size_so_far = UINT_MAX;
-		best = rover = net_random() % (high - low) + low;
+		best = rover = net_random() % remaining + low;
 
 		/* 1st pass: look for empty (or shortest) hash chain */
 		for (i = 0; i < UDP_HTABLE_SIZE; i++) {

[IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493

2007-10-10 Thread Brian Haley


Hi,

From RFC 3493, Section 5.2:

  IPV6_MULTICAST_IF

 Set the interface to use for outgoing multicast packets.  The
 argument is the index of the interface to use.  If the
 interface index is specified as zero, the system selects the
 interface (for example, by looking up the address in a routing
 table and using the resulting interface).

This patch adds support for (index == 0) to reset the value to it's 
original state, allowing the system to choose the best interface.  IPv4 
already behaves this way.


-Brian


Signed-off-by: Brian Haley <[EMAIL PROTECTED]>
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 532425d..309284e 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -539,6 +539,13 @@ done:
 	case IPV6_MULTICAST_IF:
 		if (sk->sk_type == SOCK_STREAM)
 			goto e_inval;
+
+		if (val == 0) {
+			np->mcast_oif = 0;
+			retv = 0;
+			break;
+		}
+
 		if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
 			goto e_inval;

Re: [PATCH] Evict tmp variable from the stack in ip6_evictor

2007-10-10 Thread Patrick McHardy


Pavel Emelyanov wrote:

Patrick McHardy wrote:

Pavel Emelyanov wrote:

The list_head *tmp is used to help getting the first entry in
the ip6_frag_lru_list list. There is a simpler way to do it


The exact same code exists in ip_fragment.c and nf_conntrack_reasm.c,
please also change it there.


Hm, indeed. But I see that the structs frag_queue in reassembly.c, ipq 
in ip_fragment.c and nf_ct_frag6_queue in nf code looks VERY similar 
and very much of code (like link/unlink or evict) looks the same too.


Maybe it's worth creating something like struct skb_fragment and
consolidate all the common stuff into some net/core/lib_frag.c? Or
is there some hidden reason for keeping this code splitted?



I'm not sure if its possible between IPv4 and IPv6, but sharing code
between IPv6 reassembly and netfilter/ipv6 would be nice.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Waskiewicz Jr, Peter P

> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Wed, 10 Oct 2007 12:23:31 +0200
> 
> > On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
> > > The chip I was working with at the time (UltraSPARC-IIi) 
> compressed 
> > > all the linear stores into 64-byte full cacheline 
> transactions via 
> > > the store buffer.
> > 
> > That's a pretty old CPU. Conclusions on more modern ones 
> might be different.
> 
> Cache matters, just scale the numbers.
> 
> > I suppose it would be an interesting experiment at least.
> 
> Absolutely.
> 
> I've always gotten very poor results when increasing the TX 
> queue a lot, for example with NIU the point of diminishing 
> returns seems to be in the range of 256-512 TX descriptor 
> entries and this was with 1.6Ghz cpus.

We've done similar testing with ixgbe to push maximum descriptor counts,
and we lost performance very quickly in the same range you're quoting on
NIU.

Cheers,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] do not give access to 1-1024 ports for autobinding

2007-10-10 Thread Stephen Hemminger

On Wed, 10 Oct 2007 18:34:49 +0400
"Denis V. Lunev" <[EMAIL PROTECTED]> wrote:

> This patch prevents possibility to give 1-1024 port range for autobinding.
> {1, 1} may only takes some sense for deep embedded people.
> 
> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
> 
> --- ./net/ipv4/sysctl_net_ipv4.c.port22007-10-10 17:46:48.0 
> +0400
> +++ ./net/ipv4/sysctl_net_ipv4.c  2007-10-10 18:08:00.0 +0400
> @@ -25,7 +25,7 @@ extern int sysctl_ip_nonlocal_bind;
>  #ifdef CONFIG_SYSCTL
>  static int zero;
>  static int tcp_retr1_max = 255;
> -static int ip_local_port_range_min[] = { 1, 1 };
> +static int ip_local_port_range_min[] = { 1024, 1024 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
>  #endif
>  
> -

That only limits the sysctl, which seems completely counter productive.
Sounds like more of the "stop root from shooting themselves" patches.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen

> We've done similar testing with ixgbe to push maximum descriptor counts,
> and we lost performance very quickly in the same range you're quoting on
> NIU.

Did you try it with WC writes to the ring or CLFLUSH?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Bill Fink

On Tue, 09 Oct 2007, David Miller wrote:

> From: jamal <[EMAIL PROTECTED]>
> Date: Tue, 09 Oct 2007 17:56:46 -0400
> 
> > if the h/ware queues are full because of link pressure etc, you drop. We
> > drop today when the s/ware queues are full. The driver txmit lock takes
> > place of the qdisc queue lock etc. I am assuming there is still need for
> > that locking. The filter/classification scheme still works as is and
> > select classes which map to rings. tc still works as is etc.
> 
> I understand your suggestion.
> 
> We have to keep in mind, however, that the sw queue right now is 1000
> packets.  I heavily discourage any driver author to try and use any
> single TX queue of that size.  Which means that just dropping on back
> pressure might not work so well.
> 
> Or it might be perfect and signal TCP to backoff, who knows! :-)

I can't remember the details anymore, but for 10-GigE, I have encountered
cases where I was able to significantly increase TCP performance by
increasing the txqueuelen to 1, which is the setting I now use for
any 10-GigE testing.

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue

2007-10-10 Thread Moni Shoua

Jay Vosburgh wrote:
> David Miller <[EMAIL PROTECTED]> wrote:
> 
>> From: Jeff Garzik <[EMAIL PROTECTED]>
>> Date: Tue, 09 Oct 2007 20:56:35 -0400
>>
>>> Jeff Garzik wrote:
 applied patches 1-9

 the only thing that was a hiccup during submission is that your email 
 subject lines did not contain a notion of ordering "[PATCH 1/9] ...". 
 But other than that, the git-send-email went flawlessly.
>>> unfortunately it does not seem to build flawlessly:
>> Yeah it doesn't handle Stephen Hemmingers headerops change
>> in net-2.6.24
> 
>   Gaah.  I'll sort it out and repost.
> 
>   -J
> 
> ---
>   -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]

Hi Jay, Jeff
Thanks for the help with making the patch work compile under 2.6.24.
However, patch #3 has a missing line in bond_setup_by_slave that should look 
like this

bond_dev->header_ops= slave_dev->header_ops;

I rewrote the patch and also fixed patch #8 that became broken.

I would send the new patches now but there is more
I also ran a test for the code in the branch of 2.6.24 and found a problem.
I see that ifconfig down doesn't return (for IPoIB interfaces) and it's stuck 
in napi_disable() in the kernel (any idea why?)

I am trying to solve it now so I'd like to wait a short time before applying 
these patches. 
I guess that I'll need to add something.

thanks
   MoniS

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] natsemi: Check return value for pci_enable_device()

2007-10-10 Thread Mark Brown

pci_enable_device() is __must_check so do that in natsemi_resume().

Signed-off-by: Mark Brown <[EMAIL PROTECTED]>
---
 drivers/net/natsemi.c |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/natsemi.c b/drivers/net/natsemi.c
index b881786..50e1ec6 100644
--- a/drivers/net/natsemi.c
+++ b/drivers/net/natsemi.c
@@ -3314,13 +3314,19 @@ static int natsemi_resume (struct pci_dev *pdev)
 {
struct net_device *dev = pci_get_drvdata (pdev);
struct netdev_private *np = netdev_priv(dev);
+   int ret = 0;
 
rtnl_lock();
if (netif_device_present(dev))
goto out;
if (netif_running(dev)) {
BUG_ON(!np->hands_off);
-   pci_enable_device(pdev);
+   ret = pci_enable_device(pdev);
+   if (ret < 0) {
+   dev_err(&pdev->dev,
+   "pci_enable_device() failed: %d\n", ret);
+   goto out;
+   }
/*  pci_power_on(pdev); */
 
napi_enable(&np->napi);
@@ -3340,7 +3346,7 @@ static int natsemi_resume (struct pci_dev *pdev)
netif_device_attach(dev);
 out:
rtnl_unlock();
-   return 0;
+   return ret;
 }
 
 #endif /* CONFIG_PM */
-- 
1.5.3.4

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Waskiewicz Jr, Peter P

> -Original Message-
> From: Andi Kleen [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 10, 2007 9:02 AM
> To: Waskiewicz Jr, Peter P
> Cc: David Miller; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> netdev@vger.kernel.org; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net 
> core use batching
> 
> > We've done similar testing with ixgbe to push maximum descriptor 
> > counts, and we lost performance very quickly in the same 
> range you're 
> > quoting on NIU.
> 
> Did you try it with WC writes to the ring or CLFLUSH?
> 
> -Andi

Hmm, I think it might be slightly different, but it still shows queue
depth vs. performance.  I was actually referring to how many descriptors
we can represent a packet with before it becomes a problem wrt
performance.  This morning I tried to actually push my ixgbe NIC hard
enough to come close to filling the ring with packets (384-byte
packets), and even on my 8-core Xeon I can't do it.  My system can't
generate enough I/O to fill the hardware queues before CPUs max out.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] do not give access to 1-1024 ports for autobinding

2007-10-10 Thread Denis V. Lunev

Stephen Hemminger wrote:
> On Wed, 10 Oct 2007 18:34:49 +0400
> "Denis V. Lunev" <[EMAIL PROTECTED]> wrote:
> 
>> This patch prevents possibility to give 1-1024 port range for autobinding.
>> {1, 1} may only takes some sense for deep embedded people.
>>
>> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
>>
>> --- ./net/ipv4/sysctl_net_ipv4.c.port2   2007-10-10 17:46:48.0 
>> +0400
>> +++ ./net/ipv4/sysctl_net_ipv4.c 2007-10-10 18:08:00.0 +0400
>> @@ -25,7 +25,7 @@ extern int sysctl_ip_nonlocal_bind;
>>  #ifdef CONFIG_SYSCTL
>>  static int zero;
>>  static int tcp_retr1_max = 255;
>> -static int ip_local_port_range_min[] = { 1, 1 };
>> +static int ip_local_port_range_min[] = { 1024, 1024 };
>>  static int ip_local_port_range_max[] = { 65535, 65535 };
>>  #endif
>>  
>> -
> 
> That only limits the sysctl, which seems completely counter productive.
> Sounds like more of the "stop root from shooting themselves" patches.
> 

They have sense for the case of multiple network namespaces, where root
in the other namespace can be treated as a user to initial namespace.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] more robust inet range checking

2007-10-10 Thread Stephen Hemminger

More complete version of local port range checking.

1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.
3. Add port randomization to SCTP. This is a new feature but
   easier than maintaining old code that was broken if range
   changed.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>


---
 drivers/infiniband/core/cma.c   |   24 ++--
 include/net/ip.h|3 +
 net/ipv4/inet_connection_sock.c |   26 ++---
 net/ipv4/inet_hashtables.c  |   13 +++---
 net/ipv4/sysctl_net_ipv4.c  |   77 
 net/ipv4/tcp_ipv4.c |1 
 net/ipv4/udp.c  |   18 -
 net/ipv6/inet6_hashtables.c |   13 +++---
 net/sctp/protocol.c |1 
 net/sctp/socket.c   |   26 -
 security/selinux/hooks.c|   37 ++-
 11 files changed, 157 insertions(+), 82 deletions(-)

--- a/include/net/ip.h  2007-10-10 08:26:57.0 -0700
+++ b/include/net/ip.h  2007-10-10 09:35:26.0 -0700
@@ -171,7 +171,8 @@ extern unsigned long snmp_fold_field(voi
 extern int snmp_mib_init(void *ptr[2], size_t mibsize, size_t mibalign);
 extern void snmp_mib_free(void *ptr[2]);
 
-extern int sysctl_local_port_range[2];
+extern void inet_get_local_port_range(int range[2]);
+
 extern int sysctl_ip_default_ttl;
 extern int sysctl_ip_nonlocal_bind;
 
--- a/net/ipv4/inet_connection_sock.c   2007-10-10 09:29:03.0 -0700
+++ b/net/ipv4/inet_connection_sock.c   2007-10-10 09:52:49.0 -0700
@@ -33,6 +33,19 @@ EXPORT_SYMBOL(inet_csk_timer_bug_msg);
  * This array holds the first and last local port number.
  */
 int sysctl_local_port_range[2] = { 32768, 61000 };
+DEFINE_SEQLOCK(sysctl_port_range_lock);
+
+void inet_get_local_port_range(int range[2])
+{
+   unsigned seq;
+   do {
+   seq = read_seqbegin(&sysctl_port_range_lock);
+
+   range[0] = sysctl_local_port_range[0];
+   range[1] = sysctl_local_port_range[1];
+   } while (read_seqretry(&sysctl_port_range_lock, seq));
+}
+EXPORT_SYMBOL(inet_get_local_port_range);
 
 int inet_csk_bind_conflict(const struct sock *sk,
   const struct inet_bind_bucket *tb)
@@ -77,10 +90,11 @@ int inet_csk_get_port(struct inet_hashin
 
local_bh_disable();
if (!snum) {
-   int low = sysctl_local_port_range[0];
-   int high = sysctl_local_port_range[1];
-   int remaining = (high - low) + 1;
-   int rover = net_random() % (high - low) + low;
+   int remaining, range[2], rover;
+
+   inet_get_local_port_range(range);
+   remaining = range[1] - range[0];
+   rover = net_random() % (range[1] - range[0]) + range[0];
 
do {
head = &hashinfo->bhash[inet_bhashfn(rover, 
hashinfo->bhash_size)];
@@ -91,8 +105,8 @@ int inet_csk_get_port(struct inet_hashin
break;
next:
spin_unlock(&head->lock);
-   if (++rover > high)
-   rover = low;
+   if (++rover > range[1])
+   rover = range[0];
} while (--remaining > 0);
 
/* Exhausted local port range during search?  It is not
--- a/net/ipv4/inet_hashtables.c2007-10-10 09:27:02.0 -0700
+++ b/net/ipv4/inet_hashtables.c2007-10-10 09:40:39.0 -0700
@@ -279,19 +279,18 @@ int inet_hash_connect(struct inet_timewa
int ret;
 
if (!snum) {
-   int low = sysctl_local_port_range[0];
-   int high = sysctl_local_port_range[1];
-   int range = high - low;
-   int i;
-   int port;
+   int i, count, range[2], port;
static u32 hint;
u32 offset = hint + inet_sk_port_offset(sk);
struct hlist_node *node;
struct inet_timewait_sock *tw = NULL;
 
+   inet_get_local_port_range(range);
+   count = range[1] - range[0];
+
local_bh_disable();
-   for (i = 1; i <= range; i++) {
-   port = low + (i + offset) % range;
+   for (i = 1; i <= count; i++) {
+   port = range[0] + (i + offset) % count;
head = &hinfo->bhash[inet_bhashfn(port, 
hinfo->bhash_size)];
spin_lock(&head->lock);
 
--- a/net/ipv4/sysctl_net_ipv4.c2007-10-10 08:27:00.0 -0700
+++ b/net/ipv4/sysctl_net_ipv4.c2007-10-10 09:46:12.0 -0700
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -25,8 +26,6 @@ extern int sysctl_ip_nonlocal_bind;
 #ifdef CONFIG_SYSCTL
 static int zero;
 static int tcp_retr1_max = 255;
-static int ip_local_port_range_min[]

Re: [PATCH] do not give access to 1-1024 ports for autobinding

2007-10-10 Thread Stephen Hemminger

On Wed, 10 Oct 2007 20:59:13 +0400
"Denis V. Lunev" <[EMAIL PROTECTED]> wrote:

> Stephen Hemminger wrote:
> > On Wed, 10 Oct 2007 18:34:49 +0400
> > "Denis V. Lunev" <[EMAIL PROTECTED]> wrote:
> > 
> >> This patch prevents possibility to give 1-1024 port range for autobinding.
> >> {1, 1} may only takes some sense for deep embedded people.
> >>
> >> Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
> >>
> >> --- ./net/ipv4/sysctl_net_ipv4.c.port2 2007-10-10 17:46:48.0 
> >> +0400
> >> +++ ./net/ipv4/sysctl_net_ipv4.c   2007-10-10 18:08:00.0 +0400
> >> @@ -25,7 +25,7 @@ extern int sysctl_ip_nonlocal_bind;
> >>  #ifdef CONFIG_SYSCTL
> >>  static int zero;
> >>  static int tcp_retr1_max = 255;
> >> -static int ip_local_port_range_min[] = { 1, 1 };
> >> +static int ip_local_port_range_min[] = { 1024, 1024 };
> >>  static int ip_local_port_range_max[] = { 65535, 65535 };
> >>  #endif
> >>  
> >> -
> > 
> > That only limits the sysctl, which seems completely counter productive.
> > Sounds like more of the "stop root from shooting themselves" patches.
> > 
> 
> They have sense for the case of multiple network namespaces, where root
> in the other namespace can be treated as a user to initial namespace.

IMHO  don't want to treat root as a complete idiot like normal users.
As long as what root requests doesn't create a security problem, it
should be allowed.  The port space is per namespace right? The sysctl
values should be per namespace as well.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net-2.6.24 rebased...

2007-10-10 Thread Oliver Hartkopp

David Miller wrote:

From: Urs Thuermann <[EMAIL PROTECTED]>
Date: 09 Oct 2007 23:13:42 +0200

Last week I have sent another version of our patch series for PF_CAN.
The changes after the last review feedback were only cosmetics.

Do you have any plans with that code for this or the next release?

I think PF_CAN will go in 2.6.25

Good news. Thanks!

I'll keep on tracking the current patch flow to be sure that we're still 
on the head of development, when net-2.6.25 hits the ground.

Best regards,
Oliver
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] more robust inet range checking

2007-10-10 Thread Vlad Yasevich

Stephen Hemminger wrote:
> More complete version of local port range checking.
> 
> 1. Enforce that low < high when setting.
> 2. Use seqlock to ensure atomic update.
> 3. Add port randomization to SCTP. This is a new feature but
>easier than maintaining old code that was broken if range
>changed.
> 
> Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
> 

Ack the SCTP portion.  Much nicer and a much needed improvement.

Thanks
-vlad
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

authenc compile warnings in current net-2.6.24

2007-10-10 Thread Oliver Hartkopp


Hi Herbert,

CC [M] crypto/authenc.o
crypto/authenc.c: In function ‘crypto_authenc_hash’:
crypto/authenc.c:88: warning: ‘cryptlen’ may be used uninitialized in 
this function
crypto/authenc.c:87: warning: ‘dst’ may be used uninitialized in this 
function

crypto/authenc.c: In function ‘crypto_authenc_decrypt’:
crypto/authenc.c:163: warning: ‘cryptlen’ may be used uninitialized in 
this function

crypto/authenc.c:163: note: ‘cryptlen’ was declared here
crypto/authenc.c:162: warning: ‘src’ may be used uninitialized in this 
function

crypto/authenc.c:162: note: ‘src’ was declared here

do you already know these warnings?

Oliver
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-10 Thread Eric W. Biederman

Pavel Emelyanov <[EMAIL PROTECTED]> writes:

>> I know there are several data structures internal to the kernel that
>> are indexed by ifindex, and not struct net_device *.  There is the
>> iflink field in struct net_device.  We need a way to refer to network
>> devices in other namespaces in rtnetlink in an unambiguous way.   I
>> don't see any real problems with a global ifindex assignment until
>> we start migrating applications.
>> 
>> So please hold off on this until the kernel has been audited and
>> we have removed all of the uses of ifindex that assume ifindex is
>> global, that we can find.
>
> Ok.

Thanks.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue

2007-10-10 Thread Roland Dreier

 > I also ran a test for the code in the branch of 2.6.24 and found a problem.
 > I see that ifconfig down doesn't return (for IPoIB interfaces) and it's 
 > stuck in napi_disable() in the kernel (any idea why?)

For what it's worth, I took the upstream 2.6.23 git tree and merged in
Dave's latest net-2.6.24 tree and my latest for-2.6.24 tree and tried
that.  I brought up an IPoIB interface, sent a few pings, and did
ifconfig down, and it worked fine.

Can you try the same thing without the bonding patches to see if your
setup works OK too?

Also can you give more details about what you do to get ifconfig down stuck?

 - R.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-10 Thread Johannes Berg

On Tue, 2007-10-09 at 11:41 -0600, Eric W. Biederman wrote:

> So please hold off on this until the kernel has been audited and
> we have removed all of the uses of ifindex that assume ifindex is
> global, that we can find.

I certainly have this assumption in the wireless code (cfg80211). How
would I go about removing it? Are netlink sockets per-namespace so I can
use the namespace of the netlink socket to look up a netdev?

johannes

signature.asc
Description: This is a digitally signed message part

Re: [RFC] more robust inet range checking

2007-10-10 Thread Brian Haley


Stephen Hemminger wrote:

 int inet_csk_bind_conflict(const struct sock *sk,
   const struct inet_bind_bucket *tb)
@@ -77,10 +90,11 @@ int inet_csk_get_port(struct inet_hashin
 
 	local_bh_disable();

if (!snum) {
-   int low = sysctl_local_port_range[0];
-   int high = sysctl_local_port_range[1];
-   int remaining = (high - low) + 1;
-   int rover = net_random() % (high - low) + low;
+   int remaining, range[2], rover;
+
+   inet_get_local_port_range(range);
+   remaining = range[1] - range[0];
+   rover = net_random() % (range[1] - range[0]) + range[0];


nit-pick:
rover = net_random() % remaining + range[0];


--- a/net/ipv4/udp.c2007-10-10 08:27:00.0 -0700
+++ b/net/ipv4/udp.c2007-10-10 09:44:35.0 -0700
@@ -147,13 +147,13 @@ int __udp_lib_get_port(struct sock *sk, 
 	write_lock_bh(&udp_hash_lock);
 
 	if (!snum) {

-   int i;
-   int low = sysctl_local_port_range[0];
-   int high = sysctl_local_port_range[1];
+   int i, range[2];
unsigned rover, best, best_size_so_far;


Should these be signed ints?  They're the only ones that are unsigned, 
but I don't know why.



--- a/net/sctp/protocol.c   2007-10-10 08:27:00.0 -0700
+++ b/net/sctp/protocol.c   2007-10-10 09:58:21.0 -0700
@@ -1173,7 +1173,6 @@ SCTP_STATIC __init int sctp_init(void)
}
 
 	spin_lock_init(&sctp_port_alloc_lock);

-   sctp_port_rover = sysctl_local_port_range[0] - 1;


I think you can remove the port_rover definition in sctp/structs.h and 
also the lock that protects it.  Patch below for that which can be 
applied on-top of yours.


-Brian


Remove SCTP port_rover and port_alloc_lock as they're no longer required.

Signed-off-by: Brian Haley <[EMAIL PROTECTED]>

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 448f713..c1a083c 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -197,8 +197,6 @@ extern struct sctp_globals {
 
 	/* This is the sctp port control hash.	*/
 	int port_hashsize;
-	int port_rover;
-	spinlock_t port_alloc_lock;  /* Protects port_rover. */
 	struct sctp_bind_hashbucket *port_hashtable;
 
 	/* This is the global local address list.
@@ -245,8 +243,6 @@ extern struct sctp_globals {
 #define sctp_assoc_hashsize		(sctp_globals.assoc_hashsize)
 #define sctp_assoc_hashtable		(sctp_globals.assoc_hashtable)
 #define sctp_port_hashsize		(sctp_globals.port_hashsize)
-#define sctp_port_rover			(sctp_globals.port_rover)
-#define sctp_port_alloc_lock		(sctp_globals.port_alloc_lock)
 #define sctp_port_hashtable		(sctp_globals.port_hashtable)
 #define sctp_local_addr_list		(sctp_globals.local_addr_list)
 #define sctp_local_addr_lock		(sctp_globals.addr_list_lock)
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 80df457..81b26c5 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1172,8 +1172,6 @@ SCTP_STATIC __init int sctp_init(void)
 		sctp_port_hashtable[i].chain = NULL;
 	}
 
-	spin_lock_init(&sctp_port_alloc_lock);
-
 	printk(KERN_INFO "SCTP: Hash tables configured "
 			 "(established %d bind %d)\n",
 		sctp_assoc_hashsize, sctp_port_hashsize);
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e1e2d2c..293200d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5321,7 +5321,6 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 		remaining = range[1] - range[0];
 		rover = net_random() % remaining + range[0];
 
-		sctp_spin_lock(&sctp_port_alloc_lock);
 		do {
 			rover++;
 			if ((rover < range[0]) || (rover > range[1]))
@@ -5337,7 +5336,6 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 		next:
 			sctp_spin_unlock(&head->lock);
 		} while (--remaining > 0);
-		sctp_spin_unlock(&sctp_port_alloc_lock);
 
 		/* Exhausted local port range during search? */
 		ret = 1;

Re: [Devel] [PATCH 1/5] net: Modify all rtnetlink methods to only work in the initial namespace

2007-10-10 Thread Eric W. Biederman

"Denis V. Lunev" <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> Before I can enable rtnetlink to work in all network namespaces
>> I need to be certain that something won't break.  So this
>> patch deliberately disables all of the rtnletlink methods in everything
>> except the initial network namespace.  After the methods have been
>> audited this extra check can be disabled.
>>
> [...]
>>  static int br_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
>>  {
>> +struct net *net = skb->sk->sk_net;
>>  struct net_device *dev;
>>  int idx;
>>  
>
> I've read some code today greping 'init_net.loopback_dev' and found
> interesting non-trivial for me issue.
>
> Network namespace is extracted from the packet in two different ways in
> TCP. This is a socket for outgoing path and a device for incoming.
> Though, there are some places called uniformly both from incoming and
> outgoing path.
>
> Typical example is netfilters. They are called uniformly all around the
> code. The prototype is the following:
>
> static unsigned int reject6_target(struct sk_buff **pskb,
>const struct net_device *in,
>const struct net_device *out,
>unsigned int hooknum,
>const struct xt_target *target,
>const void *targinfo);
>
> So, we are bound to the following options:
> - perform additional non-uniform hacks around to place 'struct net' into
>   other and other structures like xt_target
> - add 7th parameter here and over

> - introduce an skb_net field in the 'struct sk_buff' making all code
>   uniform, at least when we have an skb

No.  That bloats a sk_buff, changes the semantics of moving a skb
around, and decreases performance (because we have to maintain the
field on a fast path).  

There will not be a skb_net field.

The entire concept of skb_net is a maintenance disaster.

> I think that this is not the last place with such a parameter list and
> we should make a decision at this point when the code in not mainline yet.

Certainly that is what I have a proof of concept tree for.  So we can
see how these things look before we merge them. 

> As far as I understand, netfilters are not touched by the Eric and we
> can face some non-trivial problems there.

No.  In my proof of concept tree I should have working per network
namespace netfilter code.  My intention was to just do enough to see
what the impact would be so most of the netfilter code (in my tree)
insists on running in the initial network namespace.  But there are
a few pieces that are fully converted.  Please take a look.

> So, if my point about uniformity is valid, this patchset looks wrong and
> should be re-worked :(

This patchset does need to get rebased on top of net-2.6.25 when it
opens and hopefully your patchset to remove the unnecessary work in
rtnl_unlock, and to really process netlink requests in process
context.  I see a need for the more fundamental change you seem to
be advocating.

Differentiating between the incoming and the outgoing code paths is
something we already do permission checking, for locking, for
sleeping, etc.  Modifying the code requires reading and understanding
it in context.  That is the nature of code.

This does make large patches going across the entire networking stack
making something a network namespace parameter difficult, but it
should not cause any problem for maintenance or other work on
the code.  As shown by the fact that even outside the tree rebasing my
network namespace patches has not been all that difficult.

So no I don't think uniformity, or beauty or elegance is what we
are after right now.  Trying to hard in that direction ultimately
obfuscates the code.

What we want is something that is simple, straight forward, and
doesn't require you to be an expert in network namespaces to
understand the code or the patches.

In the particular case of the netfilter hooks we don't have a
network namespace parameter laying around before we call NF_HOOK,
and the idiom "net = (in?in:out)->nd_net" seems perfectly accurate
so it seems reasonable to me to derive the network namespace that
way in generic code.  Although thinking about this.  We know which
hooks we are being called from so we may in fact actually know
if which of in or out must be valid when we get to the netfilter
hook.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][BNX2X] round three

2007-10-10 Thread Eliezer Tamir


Eliezer Tamir wrote:

The full patch is available at:
[EMAIL PROTECTED]://ftp1.broadcom.com/0001-BNX2X-0.40.10a-net-2.6.24.patch


Just when I thought I have beaten the line beast.
(or maybe it's just too much work and not enough sleep.)

the right links are of course:
ftp://[EMAIL PROTECTED]/0001-BNX2X-0.40.10a-net-2.6.24.patch

and
ftp://[EMAIL PROTECTED]/0001-BNX2X-0.40.10a-net-2.6.24.patch.gz




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-10 Thread Eric W. Biederman

Johannes Berg <[EMAIL PROTECTED]> writes:

> On Tue, 2007-10-09 at 11:41 -0600, Eric W. Biederman wrote:
>
>> So please hold off on this until the kernel has been audited and
>> we have removed all of the uses of ifindex that assume ifindex is
>> global, that we can find.
>
> I certainly have this assumption in the wireless code (cfg80211). How
> would I go about removing it? Are netlink sockets per-namespace so I can
> use the namespace of the netlink socket to look up a netdev?

Yes.  Netlink sockets are per-namespace and you can use the namespace
of a netlink socket to look up a netdev.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: authenc compile warnings in current net-2.6.24

2007-10-10 Thread Sebastian Siewior

* Oliver Hartkopp | 2007-10-10 19:53:53 [+0200]:

> CC [M] crypto/authenc.o
> crypto/authenc.c: In function ?crypto_authenc_hash?:
> crypto/authenc.c:88: warning: ?cryptlen? may be used uninitialized in this 
> function
> crypto/authenc.c:87: warning: ?dst? may be used uninitialized in this 
> function
> crypto/authenc.c: In function ?crypto_authenc_decrypt?:
> crypto/authenc.c:163: warning: ?cryptlen? may be used uninitialized in this 
> function
> crypto/authenc.c:163: note: ?cryptlen? was declared here
> crypto/authenc.c:162: warning: ?src? may be used uninitialized in this 
> function
> crypto/authenc.c:162: note: ?src? was declared here
>
> do you already know these warnings?

Those warnings are looking like a compiler bug to me.

> Oliver

Sebastian
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493

2007-10-10 Thread David Stevens

What about just checking for 0 in the later test?

if (val && __dev_get_by_index(val) == NULL) {
...


+-DLS

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-10-10 Thread Sean Hefty

The hack to use a socket and bind it to claim the port was just for 
demostrating the idea.  The correct solution, IMO, is to enhance the 
core low level 4-tuple allocation services to be more generic (eg: not 
be tied to a struct sock).  Then the host tcp stack and the host rdma 
stack can allocate TCP/iWARP ports/4tuples from this common exported 
service and share the port space.  This allocation service could also be 
used by other deep adapters like iscsi adapters if needed.


Since iWarp runs on top of TCP, the port space is really the same. 
FWIW, I agree that this proposal is the correct solution to support iWarp.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493

2007-10-10 Thread Brian Haley


David Stevens wrote:

What about just checking for 0 in the later test?

if (val && __dev_get_by_index(val) == NULL) {


We could fail the next check right before that though:

  if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
  goto e_inval;

I just mimicked what the IPv4 code does in do_ip_setsockopt().

-Brian
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [IPv6] Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493

2007-10-10 Thread David Stevens

Brian Haley <[EMAIL PROTECTED]> wrote on 10/10/2007 02:20:45 PM:

> David Stevens wrote:
> > What about just checking for 0 in the later test?
> > 
> > if (val && __dev_get_by_index(val) == NULL) {
> 
> We could fail the next check right before that though:

Right, the semantics there would be "if we have a bound
dev if, that's the only legal value here." Setting it to '0' in
that case doesn't really do anythng, anyway. But I don't care
about that semantic difference-- could even add "val &&" to the
bound_dev_if check.
What I don't like is that your "if" creates an identical
duplicate code path for the functional part of it. In this case
it's trivial (the asignment), but makes the code look more
complex than it really is. If v4 does it that way, I don't
like that either. :-)
I agree with it in general, and may not be worth the
trouble, but I'd personally prefer something like:

if (sk->sk_type == SOCK_STREAM)
goto e_inval;
if (val && sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
goto e_inval;

if (val && __dev_get_by_index(val) != NULL) {
retv = -ENODEV;
break;
}
[at this point all validity checks are done and we're following
one code path to do the work; each check is easily
identifiable.]

np->mcast_oif = val;
retv = 0;
break;

Or maybe:

if (sk->sk_type == SOCK_STREAM)
goto e_inval;

if (val) {
if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
goto e_inval;
if (__dev_get_by_index(val != NULL) {
retv = -ENODEV;
break;
}
}
np->mcast_oif = val;
retv = 0;
break;

But anyway, I made the comment; I think some form of it
should go in. :-) If you like the original better, that's
ok with me, too.

+-DLS

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] QE clock source improvements

2007-10-10 Thread Timur Tabi


This patch set adds a new property to make specifying QE clock sources
easier, adds a function to help parse the property, updates some other
functions to use an enum instead of an integer, and updates the ucc_geth
driver to take advantage of all this.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] qe: add function qe_clock_source

2007-10-10 Thread Timur Tabi

Add function qe_clock_source() which takes a string containing the name of a
QE clock source (as is typically found in device trees) and returns the
matching enum qe_clock value.

Update booting-without-of.txt to indicate that the UCC properties rx-clock
and tx-clock are deprecated and replaced with rx-clock-name and tx-clock-name,
which use strings instead of numbers to indicate QE clock sources.

Update qe_setbrg() to take an enum qe_clock instead of an integer as its
first paramter.

Signed-off-by: Timur Tabi <[EMAIL PROTECTED]>
---

This patch applies to Kumar's for-2.6.24 branch.

 arch/powerpc/sysdev/qe_lib/qe.c |   13 +++--
 include/asm-powerpc/qe.h|   98 +++
 2 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/sysdev/qe_lib/qe.c b/arch/powerpc/sysdev/qe_lib/qe.c
index 3ccd360..8551e74 100644
--- a/arch/powerpc/sysdev/qe_lib/qe.c
+++ b/arch/powerpc/sysdev/qe_lib/qe.c
@@ -167,7 +167,7 @@ unsigned int get_brg_clk(void)
 
 /* Program the BRG to the given sampling rate and multiplier
  *
- * @brg: the BRG, 1-16
+ * @brg: the BRG, QE_BRG1 - QE_BRG16
  * @rate: the desired sampling rate
  * @multiplier: corresponds to the value programmed in GUMR_L[RDCR] or
  * GUMR_L[TDCR].  E.g., if this BRG is the RX clock, and GUMR_L[RDCR]=01,
@@ -175,11 +175,14 @@ unsigned int get_brg_clk(void)
  *
  * Also note that the value programmed into the BRGC register must be even.
  */
-void qe_setbrg(unsigned int brg, unsigned int rate, unsigned int multiplier)
+void qe_setbrg(enum qe_clock brg, unsigned int rate, unsigned int multiplier)
 {
u32 divisor, tempval;
u32 div16 = 0;
 
+   if ((brg < QE_BRG1) || (brg > QE_BRG16))
+   return;
+
divisor = get_brg_clk() / (rate * multiplier);
 
if (divisor > QE_BRGC_DIVISOR_MAX + 1) {
@@ -196,7 +199,7 @@ void qe_setbrg(unsigned int brg, unsigned int rate, 
unsigned int multiplier)
tempval = ((divisor - 1) << QE_BRGC_DIVISOR_SHIFT) |
QE_BRGC_ENABLE | div16;
 
-   out_be32(&qe_immr->brg.brgc[brg - 1], tempval);
+   out_be32(&qe_immr->brg.brgc[brg - QE_BRG1], tempval);
 }
 
 /* Convert a string to a QE clock source enum
@@ -214,7 +217,7 @@ enum qe_clock qe_clock_source(const char *source)
if (strncasecmp(source, "brg", 3) == 0) {
i = simple_strtoul(source + 3, NULL, 10);
if ((i >= 1) && (i <= 16))
-   return QE_BRG1 + i - 1;
+   return (QE_BRG1 - 1) + i;
else
return QE_CLK_DUMMY;
}
@@ -222,7 +225,7 @@ enum qe_clock qe_clock_source(const char *source)
if (strncasecmp(source, "clk", 3) == 0) {
i = simple_strtoul(source + 3, NULL, 10);
if ((i >= 1) && (i <= 24))
-   return QE_CLK1 + i - 1;
+   return (QE_CLK1 - 1) + i;
else
return QE_CLK_DUMMY;
}
diff --git a/include/asm-powerpc/qe.h b/include/asm-powerpc/qe.h
index 7d53750..81403ee 100644
--- a/include/asm-powerpc/qe.h
+++ b/include/asm-powerpc/qe.h
@@ -28,6 +28,52 @@
 #define MEM_PART_SECONDARY 1
 #define MEM_PART_MURAM 2
 
+/* Clocks and BRGs */
+enum qe_clock {
+   QE_CLK_NONE = 0,
+   QE_BRG1,/* Baud Rate Generator 1 */
+   QE_BRG2,/* Baud Rate Generator 2 */
+   QE_BRG3,/* Baud Rate Generator 3 */
+   QE_BRG4,/* Baud Rate Generator 4 */
+   QE_BRG5,/* Baud Rate Generator 5 */
+   QE_BRG6,/* Baud Rate Generator 6 */
+   QE_BRG7,/* Baud Rate Generator 7 */
+   QE_BRG8,/* Baud Rate Generator 8 */
+   QE_BRG9,/* Baud Rate Generator 9 */
+   QE_BRG10,   /* Baud Rate Generator 10 */
+   QE_BRG11,   /* Baud Rate Generator 11 */
+   QE_BRG12,   /* Baud Rate Generator 12 */
+   QE_BRG13,   /* Baud Rate Generator 13 */
+   QE_BRG14,   /* Baud Rate Generator 14 */
+   QE_BRG15,   /* Baud Rate Generator 15 */
+   QE_BRG16,   /* Baud Rate Generator 16 */
+   QE_CLK1,/* Clock 1 */
+   QE_CLK2,/* Clock 2 */
+   QE_CLK3,/* Clock 3 */
+   QE_CLK4,/* Clock 4 */
+   QE_CLK5,/* Clock 5 */
+   QE_CLK6,/* Clock 6 */
+   QE_CLK7,/* Clock 7 */
+   QE_CLK8,/* Clock 8 */
+   QE_CLK9,/* Clock 9 */
+   QE_CLK10,   /* Clock 10 */
+   QE_CLK11,   /* Clock 11 */
+   QE_CLK12,   /* Clock 12 */
+   QE_CLK13,   /* Clock 13 */
+   QE_CLK14,   /* Clock 14 */
+   QE_CLK15,   /* Clock 15 */
+

[PATCH 2/2] ucc_geth: use rx-clock-name and tx-clock-name device tree properties

2007-10-10 Thread Timur Tabi

This patch updates the ucc_geth device driver to check the new rx-clock-name
and tx-clock-name properties first.  If present, it uses the new function
qe_clock_source() to obtain the clock source.  Otherwise, it checks the
deprecated rx-clock and tx-clock properties.

The device trees for 832x, 836x, and 8568 have been updated to contain the
new property names only.

Signed-off-by: Timur Tabi <[EMAIL PROTECTED]>
---

This patch applies to Kumar's for-2.6.24 branch, on top of my other patch titled
"qe: add function qe_clock_source".

 arch/powerpc/boot/dts/mpc832x_mds.dts |8 
 arch/powerpc/boot/dts/mpc832x_rdb.dts |8 
 arch/powerpc/boot/dts/mpc836x_mds.dts |8 
 arch/powerpc/boot/dts/mpc8568mds.dts  |8 
 drivers/net/ucc_geth.c|   12 +---
 5 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc832x_mds.dts 
b/arch/powerpc/boot/dts/mpc832x_mds.dts
index fcd333c..b57485b 100644
--- a/arch/powerpc/boot/dts/mpc832x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc832x_mds.dts
@@ -217,8 +217,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <19>;
-   tx-clock = <1a>;
+   rx-clock-name = "clk9";
+   tx-clock-name = "clk10";
phy-handle = < &phy3 >;
pio-handle = < &pio3 >;
};
@@ -238,8 +238,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <17>;
-   tx-clock = <18>;
+   rx-clock-name = "clk7";
+   tx-clock-name = "clk8";
phy-handle = < &phy4 >;
pio-handle = < &pio4 >;
};
diff --git a/arch/powerpc/boot/dts/mpc832x_rdb.dts 
b/arch/powerpc/boot/dts/mpc832x_rdb.dts
index 388c8a7..e68a08b 100644
--- a/arch/powerpc/boot/dts/mpc832x_rdb.dts
+++ b/arch/powerpc/boot/dts/mpc832x_rdb.dts
@@ -202,8 +202,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <20>;
-   tx-clock = <13>;
+   rx-clock-name = "clk16";
+   tx-clock-name = "clk3";
phy-handle = <&phy00>;
pio-handle = <&ucc2pio>;
};
@@ -223,8 +223,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <19>;
-   tx-clock = <1a>;
+   rx-clock-name = "clk9";
+   tx-clock-name = "clk10";
phy-handle = <&phy04>;
pio-handle = <&ucc3pio>;
};
diff --git a/arch/powerpc/boot/dts/mpc836x_mds.dts 
b/arch/powerpc/boot/dts/mpc836x_mds.dts
index fbd1573..7a54072 100644
--- a/arch/powerpc/boot/dts/mpc836x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc836x_mds.dts
@@ -245,8 +245,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <0>;
-   tx-clock = <19>;
+   rx-clock-name = "none";
+   tx-clock-name = "clk9";
phy-handle = < &phy0 >;
phy-connection-type = "rgmii-id";
pio-handle = < &pio1 >;
@@ -267,8 +267,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <0>;
-   tx-clock = <14>;
+   rx-clock-name = "none";
+   tx-clock-name = "clk4";
phy-handle = < &phy1 >;
phy-connection-type = "rgmii-id";
pio-handle = < &pio2 >;
diff --git a/arch/powerpc/boot/dts/mpc8568mds.dts 
b/arch/powerpc/boot/dts/mpc8568mds.dts
index 5439437..cf45aab 100644
--- a/arch/powerpc/boot/dts/mpc8568mds.dts
+++ b/arch/powerpc/boot/dts/mpc8568mds.dts
@@ -333,8 +333,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <0>;
-   tx-clock = <20>;
+   rx-clock-name = "none";
+   tx-clock-name = "clk16";
pio-handle = <&pio1>;

Re: [PATCH 0/2] QE clock source improvements

2007-10-10 Thread Timur Tabi

Sorry, please ignore this set.  Something got screwed up with the patches. 
I'm going to resend.


Timur Tabi wrote:

This patch set adds a new property to make specifying QE clock sources
easier, adds a function to help parse the property, updates some other
functions to use an enum instead of an integer, and updates the ucc_geth
driver to take advantage of all this.

___
Linuxppc-dev mailing list
[EMAIL PROTECTED]
https://ozlabs.org/mailman/listinfo/linuxppc-dev




--
Timur Tabi
Linux Kernel Developer @ Freescale
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Wed, 10 Oct 2007 09:08:48 -0400

> On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote:
> 
> > I've always gotten very poor results when increasing the TX queue a
> > lot, for example with NIU the point of diminishing returns seems to
> > be in the range of 256-512 TX descriptor entries and this was with
> > 1.6Ghz cpus.
> 
> Is it interupt per packet? From my experience, you may find interesting
> results varying tx interupt mitigation parameters in addition to the
> ring parameters.
> Unfortunately when you do that, optimal parameters also depends on
> packet size. so what may work for 64B, wont work well for 1400B.

No, it was not interrupt per-packet, I was telling the chip to
interrupt me every 1/4 of the ring.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] QE clock source improvements

2007-10-10 Thread Timur Tabi


(Replaces all previous versions of this patch)

This patch set adds a new property to make specifying QE clock sources
easier, adds a function to help parse the property, updates some other
functions to use an enum instead of an integer, and updates the ucc_geth
driver to take advantage of all this.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] qe: add function qe_clock_source

2007-10-10 Thread Timur Tabi

Add function qe_clock_source() which takes a string containing the name of a
QE clock source (as is typically found in device trees) and returns the
matching enum qe_clock value.

Update booting-without-of.txt to indicate that the UCC properties rx-clock
and tx-clock are deprecated and replaced with rx-clock-name and tx-clock-name,
which use strings instead of numbers to indicate QE clock sources.

Update qe_setbrg() to take an enum qe_clock instead of an integer as its
first paramter.

Signed-off-by: Timur Tabi <[EMAIL PROTECTED]>
---

This patch applies to Kumar's for-2.6.24 branch.

 Documentation/powerpc/booting-without-of.txt |   13 
 arch/powerpc/sysdev/qe_lib/qe.c  |   41 ++-
 include/asm-powerpc/qe.h |   95 +-
 3 files changed, 99 insertions(+), 50 deletions(-)

diff --git a/Documentation/powerpc/booting-without-of.txt 
b/Documentation/powerpc/booting-without-of.txt
index 7a6c5f2..d8306ee 100644
--- a/Documentation/powerpc/booting-without-of.txt
+++ b/Documentation/powerpc/booting-without-of.txt
@@ -1615,6 +1615,19 @@ platforms are moved over to use the 
flattened-device-tree model.
- interrupt-parent : the phandle for the interrupt controller that
  services interrupts for this device.
- pio-handle : The phandle for the Parallel I/O port configuration.
+   - rx-clock-name: the UCC receive clock source
+ "none": clock source is disabled
+ "brg1" through "brg16": clock source is BRG1-BRG16, respectively
+ "clk1" through "clk24": clock source is CLK1-CLK24, respectively
+   - tx-clock-name: the UCC transmit clock source
+ "none": clock source is disabled
+ "brg1" through "brg16": clock source is BRG1-BRG16, respectively
+ "clk1" through "clk24": clock source is CLK1-CLK24, respectively
+   The following two properties are deprecated.  rx-clock has been replaced
+   with rx-clock-name, and tx-clock has been replaced with tx-clock-name.
+   Drivers that currently use the deprecated properties should continue to
+   do so, in order to support older device trees, but they should be updated
+   to check for the new properties first.
- rx-clock : represents the UCC receive clock source.
  0x00 : clock source is disabled;
  0x1~0x10 : clock source is BRG1~BRG16 respectively;
diff --git a/arch/powerpc/sysdev/qe_lib/qe.c b/arch/powerpc/sysdev/qe_lib/qe.c
index 3d57d38..8551e74 100644
--- a/arch/powerpc/sysdev/qe_lib/qe.c
+++ b/arch/powerpc/sysdev/qe_lib/qe.c
@@ -167,7 +167,7 @@ unsigned int get_brg_clk(void)
 
 /* Program the BRG to the given sampling rate and multiplier
  *
- * @brg: the BRG, 1-16
+ * @brg: the BRG, QE_BRG1 - QE_BRG16
  * @rate: the desired sampling rate
  * @multiplier: corresponds to the value programmed in GUMR_L[RDCR] or
  * GUMR_L[TDCR].  E.g., if this BRG is the RX clock, and GUMR_L[RDCR]=01,
@@ -175,11 +175,14 @@ unsigned int get_brg_clk(void)
  *
  * Also note that the value programmed into the BRGC register must be even.
  */
-void qe_setbrg(unsigned int brg, unsigned int rate, unsigned int multiplier)
+void qe_setbrg(enum qe_clock brg, unsigned int rate, unsigned int multiplier)
 {
u32 divisor, tempval;
u32 div16 = 0;
 
+   if ((brg < QE_BRG1) || (brg > QE_BRG16))
+   return;
+
divisor = get_brg_clk() / (rate * multiplier);
 
if (divisor > QE_BRGC_DIVISOR_MAX + 1) {
@@ -196,8 +199,40 @@ void qe_setbrg(unsigned int brg, unsigned int rate, 
unsigned int multiplier)
tempval = ((divisor - 1) << QE_BRGC_DIVISOR_SHIFT) |
QE_BRGC_ENABLE | div16;
 
-   out_be32(&qe_immr->brg.brgc[brg - 1], tempval);
+   out_be32(&qe_immr->brg.brgc[brg - QE_BRG1], tempval);
+}
+
+/* Convert a string to a QE clock source enum
+ *
+ * This function takes a string, typically from a property in the device
+ * tree, and returns the corresponding "enum qe_clock" value.
+*/
+enum qe_clock qe_clock_source(const char *source)
+{
+   unsigned int i;
+
+   if (strcasecmp(source, "none") == 0)
+   return QE_CLK_NONE;
+
+   if (strncasecmp(source, "brg", 3) == 0) {
+   i = simple_strtoul(source + 3, NULL, 10);
+   if ((i >= 1) && (i <= 16))
+   return (QE_BRG1 - 1) + i;
+   else
+   return QE_CLK_DUMMY;
+   }
+
+   if (strncasecmp(source, "clk", 3) == 0) {
+   i = simple_strtoul(source + 3, NULL, 10);
+   if ((i >= 1) && (i <= 24))
+   return (QE_CLK1 - 1) + i;
+   else
+   return QE_CLK_DUMMY;
+   }
+
+   return QE_CLK_DUMMY;
 }
+EXPORT_SYMBOL(qe_clock_source);
 
 /* Initialize SNUMs (thread serial numbers) according to
  * QE Module Control chapter, SNUM table
diff --git a/include/asm-powerpc/qe.h b/include/asm-powerpc/qe.h
index 0dabe46..81403ee 100644
--- a/include/asm-powerpc/qe.h
+++ b/include/asm-powerpc/qe.h
@@ -28,6 +28,52 @@
 #de

[PATCH 2/2] ucc_geth: use rx-clock-name and tx-clock-name device tree properties

2007-10-10 Thread Timur Tabi

This patch updates the ucc_geth device driver to check the new rx-clock-name
and tx-clock-name properties first.  If present, it uses the new function
qe_clock_source() to obtain the clock source.  Otherwise, it checks the
deprecated rx-clock and tx-clock properties.

The device trees for 832x, 836x, and 8568 have been updated to contain the
new property names only.

Signed-off-by: Timur Tabi <[EMAIL PROTECTED]>
---

This patch applies to Kumar's for-2.6.24 branch, on top of my other patch titled
"qe: add function qe_clock_source".

 arch/powerpc/boot/dts/mpc832x_mds.dts |8 ++--
 arch/powerpc/boot/dts/mpc832x_rdb.dts |8 ++--
 arch/powerpc/boot/dts/mpc836x_mds.dts |8 ++--
 arch/powerpc/boot/dts/mpc8568mds.dts  |8 ++--
 drivers/net/ucc_geth.c|   55 ++--
 5 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc832x_mds.dts 
b/arch/powerpc/boot/dts/mpc832x_mds.dts
index fcd333c..b57485b 100644
--- a/arch/powerpc/boot/dts/mpc832x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc832x_mds.dts
@@ -217,8 +217,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <19>;
-   tx-clock = <1a>;
+   rx-clock-name = "clk9";
+   tx-clock-name = "clk10";
phy-handle = < &phy3 >;
pio-handle = < &pio3 >;
};
@@ -238,8 +238,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <17>;
-   tx-clock = <18>;
+   rx-clock-name = "clk7";
+   tx-clock-name = "clk8";
phy-handle = < &phy4 >;
pio-handle = < &pio4 >;
};
diff --git a/arch/powerpc/boot/dts/mpc832x_rdb.dts 
b/arch/powerpc/boot/dts/mpc832x_rdb.dts
index 388c8a7..e68a08b 100644
--- a/arch/powerpc/boot/dts/mpc832x_rdb.dts
+++ b/arch/powerpc/boot/dts/mpc832x_rdb.dts
@@ -202,8 +202,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <20>;
-   tx-clock = <13>;
+   rx-clock-name = "clk16";
+   tx-clock-name = "clk3";
phy-handle = <&phy00>;
pio-handle = <&ucc2pio>;
};
@@ -223,8 +223,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <19>;
-   tx-clock = <1a>;
+   rx-clock-name = "clk9";
+   tx-clock-name = "clk10";
phy-handle = <&phy04>;
pio-handle = <&ucc3pio>;
};
diff --git a/arch/powerpc/boot/dts/mpc836x_mds.dts 
b/arch/powerpc/boot/dts/mpc836x_mds.dts
index fbd1573..7a54072 100644
--- a/arch/powerpc/boot/dts/mpc836x_mds.dts
+++ b/arch/powerpc/boot/dts/mpc836x_mds.dts
@@ -245,8 +245,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <0>;
-   tx-clock = <19>;
+   rx-clock-name = "none";
+   tx-clock-name = "clk9";
phy-handle = < &phy0 >;
phy-connection-type = "rgmii-id";
pio-handle = < &pio1 >;
@@ -267,8 +267,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <0>;
-   tx-clock = <14>;
+   rx-clock-name = "none";
+   tx-clock-name = "clk4";
phy-handle = < &phy1 >;
phy-connection-type = "rgmii-id";
pio-handle = < &pio2 >;
diff --git a/arch/powerpc/boot/dts/mpc8568mds.dts 
b/arch/powerpc/boot/dts/mpc8568mds.dts
index 5439437..cf45aab 100644
--- a/arch/powerpc/boot/dts/mpc8568mds.dts
+++ b/arch/powerpc/boot/dts/mpc8568mds.dts
@@ -333,8 +333,8 @@
 */
mac-address = [ 00 00 00 00 00 00 ];
local-mac-address = [ 00 00 00 00 00 00 ];
-   rx-clock = <0>;
-   tx-clock = <20>;
+   rx-clock-name = "none";
+   tx-clock-name = "clk16";
pio-handle = <&pi

1 2 >

1 - 100 of 167 matches

Mail list logo