Re: kern/152036: [libc] getifaddrs(3) returns truncated sockaddrs for netmasks

2011-07-12 Thread Kelly Yancey
The following reply was made to PR kern/152036; it has been noted by GNATS.

From: Kelly Yancey 
To: bug-follo...@freebsd.org
Cc:  
Subject: Re: kern/152036: [libc] getifaddrs(3) returns truncated sockaddrs for
 netmasks
Date: Tue, 12 Jul 2011 19:24:22 -0700

 Thanks, now lets just change the category to kern to reflect=85oh, wait.=
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: 1 IP - 1 Firewall - 2 Webservers

2001-12-11 Thread Kelly Yancey

On Wed, 12 Dec 2001, Tom Peck wrote:

> Hi Julian
> 
> Yes, we currently have Squid serving this purpose - but as I stated in my 
> first email, ALL incoming Client IP's and Addresses are always that of the 
> GATEWAY_BOX - so for website security and logs, this isn't the best 
> option..  I have yet to try Apache, but I have heard it acts in the same 
> way - can someone clarify this?
> 
> Thanks
> 
> Tom
> 

  I have to apologize, I deleted the original post, but as I recall you have
the actual forwarding working dandy. The only concern, which everyone has
failed to address, is that you want the NAT'ed web servers to know the
originating IP address for logging and IP-based security. Obviously, the
reason you don't have this now is that the originating request is intercepted
by squid on your gateway machine and then issueing a request to one of the
internel web servers using it's "inside" IP address on the originator's
behalf. You web server only ever sees the proxy's IP address.
  The question, then, is how to communicate the originaters IP address to the
web server. I haven't answered previously because I'm no squid expert, but
here is the solution that comes to my head:

  You could hack squid (assuming it doesn't have a knob to do it already) to
include the originating IP address as a HTTP header in the proxied
request. Then, modify your apps on the web server fetch the IP address from
this header (i.e. via environment variable) as opposed to using the value the
web server populates REMOTE_HOST with. However, the IP address in web server
logs will still be that of the proxy unless you teach the web server to
extract the IP from the new header.
  Of course, if you have the source to your web server (i.e. apache) then you
could teach it to populate REMOTE_HOST with the IP address obtained from the
squid-supplied header also and have it be transparent to your apps.

  All the said, you would have to take extra precautions in squid to not allow
remote clients to supply the header themselves (i.e. to replace the header if
it exists and add it if it doesn't), but this should be pretty
straightforward.

  I hope that answers your question (assuming I am remembering it correctly
:) ). Good luck!

  Kelly

--
Kelly Yancey  -  kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



RE: 1 IP - 1 Firewall - 2 Webservers

2001-12-11 Thread Kelly Yancey


  A quick search of google revealed that there is an apache module for this
specific purpose: http://web.systhug.com/mod_extract_forwarded/. So, if you
are using apache, this appears to do everything you need on the web-server
side. You might want to also look at the squid FAQ:
http://www.squid-cache.org/Doc/FAQ/FAQ-4.html#ss4.17

  Kelly

--
Kelly Yancey  -  kbyanc@{posi.net,FreeBSD.org}

On Tue, 11 Dec 2001, Kelly Yancey wrote:

> On Wed, 12 Dec 2001, Tom Peck wrote:
> 
> > Hi Julian
> > 
> > Yes, we currently have Squid serving this purpose - but as I stated in my 
> > first email, ALL incoming Client IP's and Addresses are always that of the 
> > GATEWAY_BOX - so for website security and logs, this isn't the best 
> > option..  I have yet to try Apache, but I have heard it acts in the same 
> > way - can someone clarify this?
> > 
> > Thanks
> > 
> > Tom
> > 
> 
>   I have to apologize, I deleted the original post, but as I recall you have
> the actual forwarding working dandy. The only concern, which everyone has
> failed to address, is that you want the NAT'ed web servers to know the
> originating IP address for logging and IP-based security. Obviously, the
> reason you don't have this now is that the originating request is intercepted
> by squid on your gateway machine and then issueing a request to one of the
> internel web servers using it's "inside" IP address on the originator's
> behalf. You web server only ever sees the proxy's IP address.
>   The question, then, is how to communicate the originaters IP address to the
> web server. I haven't answered previously because I'm no squid expert, but
> here is the solution that comes to my head:
> 
>   You could hack squid (assuming it doesn't have a knob to do it already) to
> include the originating IP address as a HTTP header in the proxied
> request. Then, modify your apps on the web server fetch the IP address from
> this header (i.e. via environment variable) as opposed to using the value the
> web server populates REMOTE_HOST with. However, the IP address in web server
> logs will still be that of the proxy unless you teach the web server to
> extract the IP from the new header.
>   Of course, if you have the source to your web server (i.e. apache) then you
> could teach it to populate REMOTE_HOST with the IP address obtained from the
> squid-supplied header also and have it be transparent to your apps.
> 
>   All the said, you would have to take extra precautions in squid to not allow
> remote clients to supply the header themselves (i.e. to replace the header if
> it exists and add it if it doesn't), but this should be pretty
> straightforward.
> 
>   I hope that answers your question (assuming I am remembering it correctly
> :) ). Good luck!
> 
>   Kelly
> 
> --
> Kelly Yancey  -  kbyanc@{posi.net,FreeBSD.org}



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



RE: 1 IP - 1 Firewall - 2 Webservers

2001-12-11 Thread Kelly Yancey

On Wed, 12 Dec 2001, Tom Peck wrote:

> YES! That's exactly the problem!  Your memory is obviously far superior to 
> most :-).
> 

  That's a scary proposition indeed! :)

> >   Of course, if you have the source to your web server (i.e. apache) then you
> >could teach it to populate REMOTE_HOST with the IP address obtained from the
> >squid-supplied header also and have it be transparent to your apps.
> 
> And if we don't :-(  One of the servers has a pre-complied OS which cannot 
> be altered in this way. Surely there must be a simpler way!!
> 

  Ack. Alright, see my previous post regarding squid, that part of the
configuration should be simple. With squid supplying the information in the
HTTP headers, the only matter left is getting the web server or web
application to extract that information and to use that for the REMOTE_HOST
IP. Perhaps you could share with us what web server you are using (again, I
apologize if you included that in your original message).
  Also, do you just need a custom app to pick up the originating client's IP
address or do you also need it to be logged or used in web server-supplied
IP-based security? The former would be simple to solve that the app should
just have to be modified to obtain the client's IP from the header. For
logging, many web servers allow you to customize the log format to include the
value associated with a given HTTP header (so you could log the
X-Forwarded-For header). If you need it from web server-supplied IP-based
security, you're probably out of luck. In this case, the web server would have
to supply a knob to enable this behaviour.

> Thanks for the time taken in responding to my problem.  Unfortunately we 
> are not prepared to go to these lengths to get the thing working how we 
> would like it..  I'm quite surprised there isn't something available to 
> make this feasible.
> 

  There is the capability in the open-source tools :): squid supplies the
information and apache, by means of the mod_extract_fordward module, can
extract it so everything is transparent. The only issue at hand is whether
your closed-source web server can be made to extract the information. :|

  Just to recap:
  The issue is that normally web servers obtain the client's IP address from
the source IP of the HTTP connection. However, in your setup, the proxy
receives the request (and therefor knows the client's IP), but it then
reissues the request using it's inside IP to your NAT'ed web server. The web
server only ever receives these proxied requests, therefor the web server
always gets the same source IP on all of it's incoming HTTP connections: that
of the proxy.
  Because of this, you need some way to communicate the client IP information
from the proxy to the web server, and a way to configure the web server to
switch from obtaining the IP from the HTTP connection and instead obtain it
from the proxy-supplied data. The first half of the puzzle is solved; squid
can pass the client IP via a HTTP header (X-Forwarded-For). All you need is a
solution for the latter half of the puzzle.

  All that said, I don't suspect too many commerical web servers are going to
supply such a knob due to the potential security issues. Forging a
X-Forward-For header is far more trivial than forging the source address of a
HTTP connection. In your scenario, I don't think it's an issue so long as you
only honor the last IP in the X-Forwarded-For's IP address list (the one your
trusted squid cache added). But commercial vendors don't necessarily have your
scenario on their radar. :|

  Kelly

--
Kelly Yancey  -  kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



RE: 1 IP - 1 Firewall - 2 Webservers

2001-12-12 Thread Kelly Yancey

On Thu, 13 Dec 2001, Tom wrote:

> Well, running Apache I'm sure it can be done - but this depends on the 
> methods which have to be taken to have it up and running.  I have posted a 
> message on the e-smith forum about adding mods to an already installed 
> Apache server - but they are pretty slow at responding to things over there 
> - so someone here can probably shed some light for me :-)
> 
> Tom
> 

  Ah, well that is good news. I'm not familiar with e-smith, but from the
looks of their site it supports installing binary apache modules via RPM. They
have a pretty good list of contributed RPMs on their site (under "modules" on
the right of e-smith.org), but unfortunately it does not include
mod_extract_forwarded. If you know the glibc, kernel version, and whatever
other linux variables you may be able to just snag a binary RPM for 
mod_extract_forwarded off the net (rpmfind.net, for example). This will be
made easier if the OS is a stock distribution (i.e. RedHat). Anyway, I'm
definately getting out of my area of expertise, so I can't be of any more
help, but it sounds like you understand the issues and well on the way of
getting things settled. Good luck,

  Kelly

--
Kelly Yancey  -  kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: crack pipe [was 1000baseSX driver support]

2002-02-05 Thread Kelly Yancey

On Tue, 5 Feb 2002, Matt Wilbur wrote:

> Thanks for all the help, my 3c985B was, uhm, having difficulty due to a,
> uhm, not completely seated riser card..
>
> Sorry for spewing before I checked the simple stuff.
>
> On that note, would anyone else have interest in a GA-621 driver?  I'll
> donate my GA-621 to the cause, it'd make life a lot easier for
> me.  Bill, I'll beg if that'll help . . .
>
> Thanks again,
> Matt
>

  Perhaps you could make mention of this at:
http://www.posi.net/freebsd/drivers/

  Kelly
  kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Overflowing sockaddr_dl's sdl_data buffer

2002-04-30 Thread Kelly Yancey


  Attached are (very) simple patches which attempt to address the problem.
I've included -net in the CC to solicit a larger audience.

On Sun, 21 Apr 2002, Bill Fenner wrote:

>
> I think that sdl_data should go back to being a variable-length buffer,
> and the source routing stuff should be reimplemented somewhere else
> (perhaps at the end of the variable-length buffer).
>
> What uses the source-routing fields?
>
>   Bill
>

  Yeah, this is the route I favor, except that it would clearly break
compatibility with 3rd party binary-only drivers.  Personally, I would really
like to see a solution implemented in the RELENG_4 branch.  To that extent,
the attached patches keep the sockaddr_dl at it's current size but allots the
entire 34 bytes needed for token-ring source routing to the sdl_data field
(for a total of 46 bytes).  The token-ring code just embeds it's source
routing information in the sdl_data field now.  I also removed setting the
source routing control field to zero for non-iso88025 sockaddr_dl's since all
of the code which examines the field appears to contingent on the interface
being of the iso88025 persuasion.
  That said, this leaves ample room in the sockaddr_dl structure for interface
name and MAC address in the sockaddr_dl (too much, but the overall size hasn't
changed).  However, token ring interface names are still limited to 6
characters before they risk overflowing the sdl_data field with their source
routing information.  This is no worse than the existing situation wherein a
token ring interface with more than 6 characters would cause the last byte(s)
of the hardware address to get clobbered by the source routing control field.
  One point I am a little leary of is that in in_arpinput() the original code
appears to have made provision for receiving an ISO88025 frame on a non-token
ring interface and trusted the source routing information contained in such a
frame.  First, is this a correct reading of the code?  And second, is this
correct behavior?  If so, I can easily restore it.
  There are 2 sets of attached patches: one for -current and one for -stable
(the one suffixed with a 4).  I've tested these pretty extensively on -stable
but haven't done any testing at all for -current (admittingly, not even a
build); furthermore all testing was just with ethernet...I do not have access
to any token ring hardware.  I would appreciate any feedback regarding the
approach and anyone who can confirm that I haven't horribly borked token ring
source routing.
  If all looks well, then ifconfig (and others?) will have to be updated to
not try and print source routing information unless the interface is token
ring.

  Thanks,

  Kelly
  kbyanc@{posi.net,FreeBSD.org}

  The original message for those subscribed to -net but not -arch:

>From [EMAIL PROTECTED] Tue Apr 30 18:13:51 2002
Date: Sun, 21 Apr 2002 01:48:42 -0700 (PDT)
From: Kelly Yancey <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Overflowing sockaddr_dl's sdl_data buffer


  While working on a product at work, I discovered that it is trivial to
overflow the sdl_data buffer in sockaddr_dl structures.  In our case,  I
enountered the bug by creating a vlan100 interface.  The sdl_data buffer is
populated with both the interface name and the parent interface's hardware
address; in his case 7 characters for the interface name and 6 more for the
parent's MAC address for a total of 13 characters (sdl_data is only defined
for 12 characters).  As a result, the sdl_rcf field is garbage (actually, the
last octet of the MAC address).  While, I worked around the problem in our
product, I would prefer to see the bug fixed in FreeBSD proper.
  So, I would like to solicit discussion of the proper fix for this bug.
Should sdl_data's length be extended (say 16 characters)?  This would surely
break binary compatibility and only postpones the issue (imagine an interface
with a longer name).  Should bound's checking be added to eliminate the
(supposedly optional) interface name from the sdl_data buffer if there is not
room?  If so, how does one ensure all drivers (including 3rd party)
perform the bounds-checking?  Surely there are other options too.  In any
event, the comment in sys/net/if_dl.h for the sdl_data field needs updating
because since the source routing information was added following the sdl_data
field it is impossible for the sdl_data field to be larger than that defined
by the structure definition.

  Thanks,

  Kelly
  kbyanc@{posi.net,FreeBSD.org}


Index: net/if_dl.h
===
RCS file: /home/cvs/acs/base/src/sys/net/if_dl.h,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 if_dl.h
--- net/if_dl.h 22 Mar 2002 04:11:00 -  1.1.1.1
+++ net/if_dl.h 30 Apr 2002 20:14:09 -
@@ -66,10 +66,8 @@
u_char  sdl_nlen;   /* interface name length, no trailing 

Request for review: patch to make netstat -rW behave as describedin netstat(1)

2002-06-03 Thread Kelly Yancey


  I would appreciate it if someone could review the attached patch which makes
netstat calculate column widths for the routing table when the -W flag is
specified rather than just picking larger arbitrary values as it does now.
Other than making -W more useful, it syncs reality to the documentation; from
netstat(1):

 -WIn certain displays, avoid truncating addresses even if this causes
   some fields to overflow.

  Basically, all the patch does is add a preliminary pass to calculate the
necessary column widths when -W is specified.  It is pretty straightforward,
but nonetheless, I'de appreciate feedback before I commit it.  Thanks,

  Kelly
  kbyanc@{posi.net,FreeBSD.org}

 * Wow, netstat is so far from WARNS-clean, it's scary.


Index: usr.bin/netstat/route.c
===
RCS file: /home/ncvs/src/usr.bin/netstat/route.c,v
retrieving revision 1.65
diff -u -u -r1.65 route.c
--- usr.bin/netstat/route.c 31 May 2002 04:36:55 -  1.65
+++ usr.bin/netstat/route.c 4 Jun 2002 03:47:57 -
@@ -126,12 +126,18 @@
 intNewTree = 0;
 
 static struct sockaddr *kgetsa (struct sockaddr *);
+static void size_cols (int ef, struct radix_node *rn);
+static void size_cols_tree (struct radix_node *rn);
+static void size_cols_rtentry (struct rtentry *rt);
 static void p_tree (struct radix_node *);
 static void p_rtnode (void);
 static void ntreestuff (void);
 static void np_rtentry (struct rt_msghdr *);
 static void p_sockaddr (struct sockaddr *, struct sockaddr *, int, int);
+static const char *fmt_sockaddr (struct sockaddr *sa, struct sockaddr *mask,
+int flags);
 static void p_flags (int, char *);
+static const char *fmt_flags(int f);
 static void p_rtentry (struct rtentry *);
 static u_long forgemask (u_long);
 static void domask (char *, u_long, u_long);
@@ -166,6 +172,7 @@
p_tree(head.rnh_treetop);
}
} else if (af == AF_UNSPEC || af == i) {
+   size_cols(i, head.rnh_treetop);
pr_family(i);
do_rtent = 1;
pr_rthdr(i);
@@ -224,17 +231,134 @@
 
 /* column widths; each followed by one space */
 #ifndef INET6
-#defineWID_DST(af) 18  /* width of destination column */
-#defineWID_GW(af)  18  /* width of gateway column */
-#defineWID_IF(af)  6   /* width of netif column */
+#defineWID_DST_DEFAULT(af) 18  /* width of destination column */
+#defineWID_GW_DEFAULT(af)  18  /* width of gateway column */
+#defineWID_IF_DEFAULT(af)  6   /* width of netif column */
 #else
-#defineWID_DST(af) \
-   ((af) == AF_INET6 ? (Wflag ? 39 : (numeric_addr ? 33: 18)) : 18)
-#defineWID_GW(af) \
-   ((af) == AF_INET6 ? (Wflag ? 31 : (numeric_addr ? 29 : 18)) : 18)
-#defineWID_IF(af)  ((af) == AF_INET6 ? 8 : 6)
+#defineWID_DST_DEFAULT(af) \
+   ((af) == AF_INET6 ? (numeric_addr ? 33: 18) : 18)
+#defineWID_GW_DEFAULT(af) \
+   ((af) == AF_INET6 ? (numeric_addr ? 29 : 18) : 18)
+#defineWID_IF_DEFAULT(af)  ((af) == AF_INET6 ? 8 : 6)
 #endif /*INET6*/
 
+static int wid_dst;
+static int wid_gw;
+static int wid_flags;
+static int wid_refs;
+static int wid_use;
+static int wid_mtu;
+static int wid_if;
+static int wid_expire;
+
+static void
+size_cols(int ef, struct radix_node *rn)
+{
+   wid_dst = WID_DST_DEFAULT(ef);
+   wid_gw = WID_GW_DEFAULT(ef);
+   wid_flags = 6;
+   wid_refs = 6;
+   wid_use = 8;
+   wid_mtu = 6;
+   wid_if = WID_IF_DEFAULT(ef);
+   wid_expire = 6;
+
+   if (Wflag)
+   size_cols_tree(rn);
+}
+
+static void
+size_cols_tree(struct radix_node *rn)
+{
+again:
+   kget(rn, rnode);
+   if (rnode.rn_bit < 0) {
+   if ((rnode.rn_flags & RNF_ROOT) == 0) {
+   kget(rn, rtentry);
+   size_cols_rtentry(&rtentry);
+   }
+   if ((rn = rnode.rn_dupedkey))
+   goto again;
+   } else {
+   rn = rnode.rn_right;
+   size_cols_tree(rnode.rn_left);
+   size_cols_tree(rn);
+   }
+}
+
+static void
+size_cols_rtentry(struct rtentry *rt)
+{
+   static struct ifnet ifnet, *lastif;
+   struct rtentry parent;
+   static char buffer[100];
+   const char *bp;
+   struct sockaddr *sa;
+   sa_u addr, mask;
+   int len;
+
+   /*
+* Don't print protocol-cloned routes unless -a.
+*/
+   if (rt->rt_flags & RTF_WASCLONED && !aflag) {
+   kget(rt->rt_parent, parent);
+   if (parent.rt_fla

Re: host routes for interface addresses

2002-06-05 Thread Kelly Yancey

On Wed, 5 Jun 2002, Iasen Kostov wrote:

> It works fine (just a warrning) with 4.4 kernel and before, but in 4.5
> there is a check for host route addition and if it fail to add a route
> it also fails to set iface address (ofcourse I've patch it for myself).
>   I need this not just for saving IPs but it's somewhat easier to route
> just throw iface i not to care about iface IPs. And something more, You
> know that the router has 1 IP and thats it , don't care on which iface
> Your are connected right now, it has 1 IP and thats your gateway.
>   This scheme looks a bit like Cisco's "ip unnumbered" interfaces and I
> don't think it's a bad idea.
>

  You might want to take a look at Marko Zec's VIPA patches that he posted to
-net a few days ago.  You should be able to find it in the mailing list
archives under the subject "Patch for review: source VIPA".
  If you have the time to review/test his patches, then perhaps we can get
it into a future release of FreeBSD (solving your problem with a viable
long-term solution).  Thanks,

  Kelly
  kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: IP_MULTICAST_LOOP

2002-06-06 Thread Kelly Yancey

On Thu, 6 Jun 2002, Vadim Egorov wrote:
>
> Hi guys!
>
> I'm playing with multicasting (-stable), and I want to disable looping back
> my outgoing packets setting IP_MULTICAST_LOOP option to 0 but it doen't
> have any effect. My app is listening to the same group it is casting.
>
> After some grepping I came across some code in netinet/ip_output.c:
>  (imo == NULL || imo->imo_multicast_loop)) {
>   /*
>* If we belong to the destination multicast group
>* on the outgoing interface, and the caller did not
>* forbid loopback, loop back a copy.
>*/
>
> The comment says 'and' but the code says '||' -- looks like an error to me.
> Except this I've got no idea what it means - does it make amy sence?
>

  You definately wouldn't want this to be && because if imo is NULL you
certainly wouldn't want to dereference it. :)  The comment's logic matches the
code, it is just that the phrasing is inverted.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
FreeBSD, The Power To Serve: http://www.freebsd.org/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: tcpdump and ipsec

2006-04-11 Thread Kelly Yancey
On Sun, 2 Apr 2006, Dmitry Pryanishnikov wrote:

>
> Hello!
>
> On Sun, 2 Apr 2006, Bjoern A. Zeeb wrote:
> >> Why not? IMHO it will be very useful feature: think about e.g. traffic
> >> shaping for several different networks which are routed via the same
> >> ipsec tunnel. Without the enc0, you can only shape them together, e.g.:
> >
> > why not shaping on the internal interface in case this is a gateway?
> > You know src and dst there too.
>
>   Gateway can also contain sources of traffic, and we should be able
> to shape all outgoing or incoming traffic (not only transit packets,
> but also locally-originated).
>
> > The only difference enc0 makes is for host-only-setups or if you want
> > to see all your unencrpyted ipsec traffic on a gateway in one place.
>
>   It seems to me that it's also useful for general traffic
> shaping/accounting/filtering purposes.
>
> Sincerely, Dmitry

  I agree 100%.  At work, we implemented the enc interface for FreeBSD
4.7 and 4.10 along with extending the divert interface such that we
could perform filtering and NAT on packets after tunnel decapsulation.
Just because one person doesn't have a use for the enc interface, does
not mean that no one does.

  Kelly

-- 
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: tcpdump and ipsec

2006-04-13 Thread Kelly Yancey
On Tue, 11 Apr 2006, Bjoern A. Zeeb wrote:

> On Tue, 11 Apr 2006, Kelly Yancey wrote:
>
> Hi,
>
> > On Sun, 2 Apr 2006, Dmitry Pryanishnikov wrote:
> >
> >> On Sun, 2 Apr 2006, Bjoern A. Zeeb wrote:
> >>>> Why not? IMHO it will be very useful feature: think about e.g. traffic
> >>>> shaping for several different networks which are routed via the same
> >>>> ipsec tunnel. Without the enc0, you can only shape them together, e.g.:
> >>>
> >>> why not shaping on the internal interface in case this is a gateway?
> >>> You know src and dst there too.
> >>
> >>   Gateway can also contain sources of traffic, and we should be able
> >> to shape all outgoing or incoming traffic (not only transit packets,
> >> but also locally-originated).
> >>
> >>> The only difference enc0 makes is for host-only-setups or if you want
> >>> to see all your unencrpyted ipsec traffic on a gateway in one place.
> >>
> >>   It seems to me that it's also useful for general traffic
> >> shaping/accounting/filtering purposes.
> >>
> >  I agree 100%.  At work, we implemented the enc interface for FreeBSD
> > 4.7 and 4.10 along with extending the divert interface such that we
> > could perform filtering and NAT on packets after tunnel decapsulation.
>
> you know you can do this with what's in there already w/o enc(4)?
> At least I have been doing it for more than two years now with 5.x
> and greater.  Actually this mail will get to you via such a setup.
>

  Really?  We aren't likely to move our product to 5.x or 6.x, but
I'm curious: how are you performing NAT on your tunnelled traffic?
  If we were just talking about filtering, I would assume you were
referring to the "ipsec" rule (which was introduced circa 4.9, hence not
available when we implemented the enc interface on 4.7).  However, I
cannot figure out for the life of me how one would perform NAT on
packets *inside* the IPsec tunnel without the enc interface.  For
example, the only pfil hook in the packet output path is is ip_output
*after* IPsec encapsulation has occurred.  Perhaps I'm missing
something.

>
> > Just because one person doesn't have a use for the enc interface, does
> > not mean that no one does.
>
> agreed.
>
> good arguments for example would also be that filtering IPSec traffic
> with pf would becomen possible easily as long as there is no such
> thing like the ipsec flag in ipfw...
>

  I'm really looking forward to hearing how you are diverting traffic to
natd before IPsec encapsulation.  Thanks,

  Kelly

-- 
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: tcpdump and ipsec

2006-04-17 Thread Kelly Yancey
On Mon, 17 Apr 2006, Bjoern A. Zeeb wrote:

> On Thu, 13 Apr 2006, Kelly Yancey wrote:
>
> > I'm curious: how are you performing NAT on your tunnelled traffic?
>
> the answer is simple: do not NAT on the ipsec interface though it's
> not fully correct because I do even NAT traffic that goes like:
>
> A  lan1(ipsec only) --- gw(NAT) --- lan2(ipsec only)  B
>
> [ipsec only == esp and ike allowed]
>
> so the better explanation perhaps is:
> do not nat on the ipsec interface of the outgoing direction.
>

  "When all you have is a hammer, everything looks like a nail" :)

  In our case, we couldn't use that hack because we have multiple
interfaces, each with its own NAT config.  We have to run natd on the
interface that the traffic is traversing.  With the enc interface, we
can handle packets inside the tunnel separate from the tunnel traffic
itself without resorting to gymnastics.
  If I had time I'd integrate PR 94829 myself, but it looks like I'm
going to have my hands full for a couple of months. :|  We'll see if
anyone else picks it up in the meantime...

  Kelly

-- 
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: freeBSD /ipfw/ divert socket

2006-04-24 Thread Kelly Yancey
On Fri, 21 Apr 2006, Amit Mondal wrote:

> Hi All,
>
> I need a little help with FreeBSD Kernel stuff. I wanna use Divert Socket to
> sniff IP packet in FreeBSD.
> For that I have compiled the kernel with options IPDIVERT and everything is
> ok.
>
> Now, when I am not really sniffing and re-injecting the packet back to the
> network stack, it is basically dropping all the packets. But I want it
> pass-through it, when no application is reading at divert socket. My
> question is, HOW CAN I MAKE IT PASS-THROUGH? IF NO APPLICATION IS READING
> FROM DIVERT SOCKET, IT SHOULD WORK AS IF THERE IS NO DIVERT SOCKET.
>
> Thanks in adavnce
>
> Rgds
> Amit
>

  Attached is a really old patch I made against FreeBSD 4.7.  It might
apply to 4.9.  Even if it doesn't, it should give you a pretty good idea
how to implement the functionality you desire.

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
FreeBSD, The Power To Serve: http://www.freebsd.org/Index: ip_fw2.c
===
RCS file: /home/cvs/acs/base/src/sys/netinet/ip_fw2.c,v
retrieving revision 1.9
retrieving revision 1.11
diff -u -p -r1.9 -r1.11
--- ip_fw2.c3 Jan 2003 23:34:19 -   1.9
+++ ip_fw2.c8 Jan 2003 06:14:48 -   1.11
@@ -580,17 +580,17 @@ ipfw_log(struct ip_fw *f, u_int hlen, st
}
if (oif || m->m_pkthdr.rcvif)
log(LOG_SECURITY | LOG_INFO,
-   "ipfw: %d %s %s %s via %s%d%s\n",
+   "ipfw: %d %s %s %s via %s%d%s (layer %d)\n",
f ? f->rulenum : -1,
action, proto, oif ? "out" : "in",
oif ? oif->if_name : m->m_pkthdr.rcvif->if_name,
oif ? oif->if_unit : m->m_pkthdr.rcvif->if_unit,
-   fragment);
+   fragment, eh ? 2 : 3);
else
log(LOG_SECURITY | LOG_INFO,
-   "ipfw: %d %s %s [no if info]%s\n",
+   "ipfw: %d %s %s [no if info]%s (layer %d)\n",
f ? f->rulenum : -1,
-   action, proto, fragment);
+   action, proto, fragment, eh ? 2 : 3);
if (limit_reached)
log(LOG_SECURITY | LOG_NOTICE,
"ipfw: limit %d reached on entry %d\n",
@@ -1939,8 +1939,10 @@ check_body:
goto done;
 
case O_FORWARD_IP:
-   if (args->eh)   /* not valid on layer2 pkts */
-   break;
+   if (args->eh && oif != NULL) {
+   /* ignore outbound layer2 pkts */
+   goto next_rule;
+   }
if (!q || dyn_dir == MATCH_FORWARD)
args->next_hop =
&((ipfw_insn_sa *)cmd)->sa;
Index: ip_input.c
===
RCS file: /home/cvs/acs/base/src/sys/netinet/ip_input.c,v
retrieving revision 1.14
retrieving revision 1.16
diff -u -p -r1.14 -r1.16
--- ip_input.c  3 Jan 2003 04:46:53 -   1.14
+++ ip_input.c  8 Jan 2003 06:16:06 -   1.16
@@ -369,8 +369,18 @@ ip_input(struct mbuf *m)
case PACKET_TAG_IPFORWARD:
args.next_hop = (struct sockaddr_in *)m->m_hdr.mh_data;
break;
+   case PACKET_TAG_IPFORWARD | M_PROTO5: {
+   /* XXX This should be taken out and shot! */
+   struct mbuf *tag = m;
+   m = m->m_next;
+   args.next_hop = (struct sockaddr_in 
*)tag->m_hdr.mh_data;
+   m_free(tag);
+   KASSERT(m->m_type != MT_TAG, ("XXX kill me"));
+   goto posttags;
+   }
}
}
+posttags:
 
KASSERT(m != NULL && (m->m_flags & M_PKTHDR) != 0,
("ip_input: no HDR"));
Index: if_ethersubr.c
===
RCS file: /home/cvs/acs/base/src/sys/net/if_ethersubr.c,v
retrieving revision 1.9
retrieving revision 1.11
diff -u -p -r1.9 -r1.11
--- if_ethersubr.c  3 Jan 2003 04:40:06 -   1.9
+++ if_ethersubr.c  8 Jan 2003 06:16:05 -   1.11
@@ -501,7 +501,7 @@ ether_ipfw_chk(struct mbuf **m0, struct 
args.oif = flags & ETHER_IPFW_OUTPUT ? ifp : NULL;
args.divert_rule = divert_rule;
args.rule = *rule;  /* matching rule to restart */
-   args.next_hop = NULL;   /* we do not sup

Re: freeBSD /ipfw/ divert socket

2006-04-24 Thread Kelly Yancey
On Mon, 24 Apr 2006, Kelly Yancey wrote:

> On Fri, 21 Apr 2006, Amit Mondal wrote:
>
> > Hi All,
> >
> > I need a little help with FreeBSD Kernel stuff. I wanna use Divert Socket to
> > sniff IP packet in FreeBSD.
> > For that I have compiled the kernel with options IPDIVERT and everything is
> > ok.
> >
> > Now, when I am not really sniffing and re-injecting the packet back to the
> > network stack, it is basically dropping all the packets. But I want it
> > pass-through it, when no application is reading at divert socket. My
> > question is, HOW CAN I MAKE IT PASS-THROUGH? IF NO APPLICATION IS READING
> > FROM DIVERT SOCKET, IT SHOULD WORK AS IF THERE IS NO DIVERT SOCKET.
> >
> > Thanks in adavnce
> >
> > Rgds
> > Amit
> >
>
>   Attached is a really old patch I made against FreeBSD 4.7.  It might
> apply to 4.9.  Even if it doesn't, it should give you a pretty good idea
> how to implement the functionality you desire.
>
>   Kelly
>

  Sorry, wrong patch.  The correct patch is attached.

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
FreeBSD, The Power To Serve: http://www.freebsd.org/Index: sys/netinet/ip_divert.c
===
RCS file: /home/cvs/acs/base/src/sys/netinet/ip_divert.c,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -p -r1.3 -r1.4
--- ip_divert.c 10 Oct 2002 20:42:00 -  1.3
+++ ip_divert.c 23 Nov 2002 05:34:10 -  1.4
@@ -109,6 +109,23 @@ static u_long  div_recvspace = DIVRCVQ;/
 /* Optimization: have this preinitialized */
 static struct sockaddr_in divsrc = { sizeof(divsrc), AF_INET };
 
+
+static int div_output(struct socket *so, struct mbuf *m,
+  struct sockaddr_in *sin, struct mbuf *control);
+static int div_attach(struct socket *so, int proto, struct proc *p);
+static int div_detach(struct socket *so);
+static int div_abort(struct socket *so);
+static int div_disconnect(struct socket *so);
+static int div_bind(struct socket *so, struct sockaddr *nam,
+struct proc *p);
+static int div_shutdown(struct socket *so);
+static int div_send(struct socket *so, int flags, struct mbuf *m,
+struct sockaddr *nam, struct mbuf *control,
+struct proc *p);
+static int div_pcblist(SYSCTL_HANDLER_ARGS);
+
+
+
 /*
  * Initialize divert connection block queue.
  */
@@ -146,8 +163,9 @@ div_input(struct mbuf *m, int off, int p
  * then pass them along with mbuf chain.
  */
 void
-divert_packet(struct mbuf *m, int incoming, int port, int rule)
+divert_packet(struct mbuf *m, int flags, int port, int rule)
 {
+   static struct socket *divnullso;
struct ip *ip;
struct inpcb *inp;
struct socket *sa;
@@ -169,7 +187,7 @@ divert_packet(struct mbuf *m, int incomi
 * But only for incoming packets.
 */
divsrc.sin_addr.s_addr = 0;
-   if (incoming) {
+   if (flags & IP_DIVERT_INCOMING) {
struct ifaddr *ifa;
 
/* Sanity check */
@@ -227,6 +245,22 @@ divert_packet(struct mbuf *m, int incomi
m_freem(m);
else
sorwakeup(sa);
+   } else if (flags & IP_DIVERT_DONTDROP) {
+   /* Pretend the packet was passed back unchanged. */
+   ipstat.ips_delivered--;
+   if (divnullso == NULL) {
+   /*
+* Allocate a dummy socket for ip_output() when
+* looping back diverted packets.
+*/
+   if (socreate(PF_INET, &divnullso, SOCK_RAW,
+   IPPROTO_DIVERT, &proc0) != 0) {
+   m_freem(m);
+   ipstat.ips_odropped++;
+   return;
+   }
+   }
+   div_output(divnullso, m, &divsrc, NULL);
} else {
m_freem(m);
ipstat.ips_noproto++;
@@ -245,8 +279,8 @@ static int
 div_output(struct socket *so, struct mbuf *m,
struct sockaddr_in *sin, struct mbuf *control)
 {
-   int error = 0;
struct m_hdr divert_tag;
+   int error = 0;
 
/*
 * Prepare the tag for divert info. Note that a packet
Index: sys/netinet/ip_fw.h
===
RCS file: /home/cvs/acs/base/src/sys/netinet/ip_fw.h,v
retrieving revision 1.4
retrieving revision 1.5
diff -u -p -r1.4 -r1.5
--- ip_fw.h 15 Nov 2002 00:11:42 -  1.4
+++ ip_fw.h 23 Nov 2002 05:34:10 -  1.5
@@ -330,6 +330,7 @@ struct ipfw_dyn_rule {
  */
 #ifdef _KERNEL
 
+#defineIP_FW_PORT_MASK 0x

Re: ipsec with ipfw divert (not NAT) encodes a packet twice breaking PMTUD

2006-09-11 Thread Kelly Yancey
On Mon, 11 Sep 2006, Eugene Grosbein wrote:

>
> >Submitter-Id:current-users
> >Originator:  Eugene Grosbein
> >Organization:Svyaz Service JSC
> >Confidential:no
> >Synopsis:ipsec with ipfw divert (not NAT) encodes a packet twice 
> >breaking PMTUD
> >Severity:serious
> >Priority:high
> >Category:kern
> >Class:   sw-bug
> >Release: FreeBSD 6.1-STABLE i386
> >Environment:
> System: FreeBSD nkz.delikates-nk.ru 6.1-STABLE FreeBSD 6.1-STABLE #1: Thu Sep 
> 7 13:31:53 KRAST 2006 [EMAIL PROTECTED]:/home/obj/home/src/sys/NKZ i386
>   options IPDIVERT
>   options IPSEC
>   options IPSEC_ESP
>
> >Description:
>   When outgoing packet encoded due to corresponding IPSEC policy
>   is passed to divert socket (f.e. to ipacctd for accounting),
>   it is encoded second time with IPSEC then. Besides obvious
>   logic error, this also results in broken Path MTU Discovery.
>
> >How-To-Repeat:
>
>   Use a kernel with options IPDIVERT, IPSEC, IPSEC_ESP
>   (my kernel also contains IPSEC_FILTERGIF, but this should not matter).
>
>   Suppose there are two local nets numbered 192.168.1.0/24
>   and 192.168.2.0/24, each has a FreeBSD router
>   (192.168.1.1 and 192.168.2.1). Routers make gif(4) tunnel between
>   and use IPSEC transport mode to encrypt its contents.
>   Their external IP addresses are 1.1.1.1 and 2.2.2.2
>

  Just FYI, when we implemented the enc interface for FreeBSD 4.10 for
one of our products at work, we encountered a similar issue.  The
problem is that you need to add a flag to the sockaddr_in passed to the
divert(4) consumer; when that consumer re-injects the packets into the
network stack, ip_output() needs to check for the flag and goto
skip_ipsec to avoid re-encapsulation.  The next issue is that
there is no room in the sockaddr_in structure for such a flag.
  We resorted to a hack (eventually, we re-implemented the divert
interface to have its own sockaddr_div rather than overloading
sockaddr_in, but that is another story).  We stuck the flag indicating
whether to skip IPsec encapsulation on input in the high bit of the
first byte in the sin_zero array.  This only works because natd(8)
doesn't inspect the (partial) interface name stored in the
sockaddr_in's sin_zero array.  "Hack" doesn't really being to describe
the hideousness of this workaround, but it will get the job done.
  It looks like the same effect might be able to be achieved by
modifying natd to be able to set the policy on the divert socket to
IPSEC_POLICY_NONE via the IP_IPSEC_POLICY socket option, but I don't
know enough about that code path to say for sure.

  Good luck,

  Kelly

-- 
Kelly Yancey  -  [EMAIL PROTECTED]  -  [EMAIL PROTECTED]

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: slow writes on nfs with bge devices

2007-01-24 Thread Kelly Yancey
On Sun, 21 Jan 2007, Max Laier wrote:

> On Sunday 21 January 2007 13:25, Bruce Evans wrote:
> > On Sun, 21 Jan 2007, Max Laier wrote:
> > > On Sunday 21 January 2007 07:25, Bruce Evans wrote:
> > >> nfs writes much less well with bge NICs than with other NICs (sk,
> > >> fxp,
> > >
> > > Do you use hardware checksumming on the bge?  There is an XXX in
> > > bge_start_locked() that looks a bit suspicious to me.
> >
> > I use the default for that.  Wouldn't checksum problems show up as
> > errors somwhere?
>
> Did you look at the code in question?  It is concerned with fragmented
> packet chains (which NFS over UDP usually generated) and only commits to
> sending them, if there are enough descriptors available at once.  This
> can easily explain burstyness.
>
> Can you just try to disable the delayed checksums via "ifconfig -txcsum"?
> Should be an easy enough test.
>

  I realize that Bruce has already identified the problem as being with
the cabling, however I wanted to add a warning that disabling hardware
checksums for bge cards is not a good idea.  You can find my analysis of
data corruption bugs caused by using bge cards without checksum
offloading in the archives:
http://lists.freebsd.org/pipermail/freebsd-net/2004-January/002530.html

  Kelly

-- 
Kelly Yancey  -  [EMAIL PROTECTED] | [EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: About "sockaddr_in" and "sockaddr_in6" structures

2002-07-09 Thread Kelly Yancey

On Tue, 9 Jul 2002, Juan Francisco Rodriguez Hervella wrote:

> Hello:
> 
> I'm seeing that "struct sockaddr_in" has a field like this:
> 
> char sin_zero[8];
> 
> Why ?
> Could anyone explain me what's used for ?
> 
> Could it be bad if I'd add the same field to
> "sockaddr_in6" ?
> 
> PS: I'm trying to implement divert sockets for
> IPv6 using the KAME implementation of "ip6fw".
> 
> Thanks.
> 
> JFRH.
> 

  The minimum size of a sockaddr's address portion (as defined in
sys/socket.h) is 14 bytes (max is SOCK_MAXADDRLEN = 255); combined with
the size of the sa_len and sa_family fields you get a minimum length of 16
bytes for the entire structure.  The sin_zero field of sockaddr_in is to
pad the structure out to this minimum length.  Given that a sockaddr_in6
is already larger than the minimum required, there really isn't any point
in adding the padding field to it.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: sysctl inferface question

2002-07-11 Thread Kelly Yancey

On Thu, 11 Jul 2002, Juan Francisco Rodriguez Hervella wrote:

> Hello:
> 
> I'm very confused with the sysctl internals.
> 
> For example, looking at the kernel source code of FreeBSD, I've realized
> 
> of the following:
> 
> netinet/in_proco.c:
> SYSCTL_NODE(_net_inet6, IPPROTO_DIVERT, divert,
> CTLFLAG_RW, 0,  "DIVERT");
> netinet/ip_divert.c:
> SYSCTL_DECL(_net_inet_divert);
> netinet/ip_divert.c:
> SYSCTL_PROC(_net_inet_divert, OID_AUTO, pcblist, CTLFLAG_RD,
> 0, 0,
> div_pcblist, "S,xinpcb", "List of active divert sockets");
> 
> Isn't this redundant ? I mean, if there is a "SYSCTL_NODE", there is
> *no* need for having
> "SYSCTL_DECL" in "ip_divert.c"... I am wrong ?
> 

  It is a scoping/linking issue: the SYSCTL_DECL is needed in
netinet/ip_divert.c so that children may be added to the node which was
defined in netinet/in_proto.c.  Without it, the very next line in
netinet/ip_divert would fail to compile because it coulding find the
parent node.  A good C reference would probably better explain the
difference between declaring a variable and defining a variable, but that
is exactly the difference you are witnessing here.

> Also, I don't undertand the meaning of the "fmt" fieldwhat is it for
> ? What's the
> meaning of "S,xinpcb" in the above example ?
> 
> Thanks.
> 
> --
> JFRH.
> 

  The fmt field is used by sysctl(8) to format the data returned from the
kernel.  The "S,xinpcb" format string tells sysctl(8) to use it's
definition of "xinpcb" formatting to render the structure.  Take a look at
/usr/src/sbin/sysctl/sysctl.c; it is a pretty light read.

  Good luck,

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Kelly Yancey

On Thu, 11 Jul 2002, Bosko Milekic wrote:

> 
> On Thu, Jul 11, 2002 at 01:56:08PM -0700, Luigi Rizzo wrote:
> > example: userland does an 8KB write, in the old case this requires
> > 4 clusters, with the new one you end up using 4 clusters and stuff
> > the remaining 16 bytes in a regular mbuf, then depending on the
> > relative producer-consumer speed the next write will try to fill
> > the mbuf and attach a new cluster, and so on... and when TCP hits
> > these data-in-mbuf blocks will have to copy rather than reference
> > the data blocks...
> > 
> > Maybe it is irrelevant for performance, maybe it is not,
> > i am not sure.
> 
>   I see what you're saying.  I think that what this means is simply that
>   the `optimal' chunk of data to send is just a different size, so
>   instead of it being 8192 bytes, it'll be something like 8180 bytes or
>   something (to account for the counters).  So, in other words, it
>   really depends on the frequency of exact 8192 sized sends in userland
>   applications.
> 

  ...or exactly 2k or 4k or 6k or 10k...

>   This is a good observation if we're going to be doing benchmarking,
>   but I'm not sure whether the repercussions are that important (unless,
>   as I said, there's a lot of applications that send exactly 8192
>   byte chunks?).  Basically, what we're doing is shifting the optimal
>   send size when using exactly 4 clusters, in this case, to (8192 - 16)
>   bytes.  We can still send with exactly 4 clusters, it's just that the
>   optimal send size is a little different, that's all (this produces a
>   small shift in block send benchmark curves, usually).
> 

  Are you kidding?  Benchmarks, presumably like every other piece of
software produced by someone trying to get the most performance out of
the system, are more likely to have power-of-two write buffers.  Are you
willing to risk that they didn't also just happen to pick a multiple of
2^11?

  Yes, it seems elegant to put the counters in the space that is normally
unused for receive mbuf clusters, but you can't just blow off Luigi's
point regarding the send side.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Kelly Yancey

On Thu, 11 Jul 2002, Bosko Milekic wrote:

>  First of all, I'm not "blowing off" anyone's comments.  I don't
>  appreciate the fact that you're eagerly instructing me to "not blow off
>  comments" (which I didn't do to begin with) without providing any more
>  constructive feedback.
> 
>  All I pointed out was that the optimal block size is merely changed
>  from an exact 2k, 4k, 8k, etc. to something slightly smaller.  What
>  point are *you* trying to put across?  Tell me what's bad about that
>  or, better: 
>  
>  Do you have a better suggestion to make?  What do *you* suggest we do
>  with the external ref. counts?  Please, spare me the flame bait.  I
>  wasn't being confrontational when I answered Luigi's post and I don't
>  need anyone turning this into something confrontational.  Thanks.
> 
> --
> Bosko Milekic
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> 

  Whoa man, that must have across completely wrong.  I didn't mean to
imply any confrontational at all.  Actually, if anything I was just
trying to restate what should be obvious (and which I think was the point
Luigi already made): that for better or worse userland apps think that
using power-of-2 write buffers will improve performance.
  You're right, I don't understand all of the issues well enough to
suggest an alternative.  And if it weren't for the fact the just about
every engineer on the planet has had the "power-of-2 good" rule drilled
into them, I would have kept my mouth shut as I usually do.  When I saw
you suggesting that the optimum size would just be a little lower without
mentioning POLA, an alarm went off in my head.

  In any event, I'll go crawl back into my corner now.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Inconsistency between net/if.c and several ethernet drivers

2002-07-16 Thread Kelly Yancey

On Tue, 16 Jul 2002, Bill Baumann wrote:

>
> In net/if.c in a couple of places, the ethernet address is needed.  This
> is stored in the arpcom structure.  A couple lines of code in if.c require
> struct arpcom be at the very begining of device softc structures.  Nearly
> all drivers observe this.  However, several do not.  Sadly, this includes
> the one I'm working on.
>
> net/if.c routines if_findindex() and if_setlladdr() gain access to the
> ethernet address via the following expression:
>
>   ((struct arpcom *)ifp->if_softc)->ac_enaddr
>
> The above code assumes that the if_softc pointer is equivalent to an
> struct arpcom pointer.  The awi, ray, lnc and pdq drivers have other
> fields at the beginning of their softc structures.  Attempts to set the
> ethernet address of these devices may cause corruption.
>
>
> Shouldn't access of arpcom be via ifp instead?
>
>   ((struct arpcom *)ifp)->ac_enaddr
>
>
> For example, if_ethersubr.c uses the following macro:
> #define IFP2AC(IFP) ((struct arpcom *)IFP)
>
> It looked to me like the other code in net, like if_ethersubr.c use ifp
> rather than if_softc to find struct arpcom.
>
> Bug?
>

  Design. :)  See page 77 of Stevens' TCP/IP Illustrated Volume 2.  By putting
the structures at the beginning of the softc, the networking code can access
them without any explicit knowledge of the driver's softc itself (i.e. it can
use the softc as an opaque encapsulated version of either the arpcom or ifnet
structures).  The bug, then, would seem to be in the network drivers that
don't follow this convention.  But I'm not familiar with those particular
drivers, so I cannot comment on them; perhaps they employ some cleverness to
circumvent the requirement (by why?).  Anyway, it should be obvious that
accessing the arpcom structure via casting from the ifnet structure or the
softc structure are supposed to have the same results, so the code your quoted
above is fine.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
"The worst sin towards our fellow creatures is not to hate them, but to be
 indifferent to them; that's the essence of inhumanity."
-- George Bernard Shaw



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Patch for review: only report protocol data via EVFILT_READ filter

2002-10-16 Thread Kelly Yancey


  Currently, the value returned in a kevent's data member by the
EVFILT_READ filter is "number of bytes in the socket buffer" which
includes control and out-of-band data.  However, this isn't particularly
useful as any read(), readv(), or readmsg() for the amount of data
reported may block if there is any non-protocol data in the buffer.  And
being that there is no way for userland applications to determine if, and
if so how much, non-protocol data is in the buffer, the reported value
cannot be trusted for anything useful.
  PR 30634 touches on this issue; UDP sockets are particularly visible
examples since they always include 16 bytes of address information in
addition to the datagram received.  However, from reading the code it
would appear that OOB data can cause a similar problem for TCP sockets.
  It seems that the overriding issue is that the read* API takes the
number of bytes of protocol data to read whereas kevent() reports the
total number of bytes available (protocol or administrative).  The
attached patch, which I would appreciate your comments on, modifies
kevent() to report just the number of bytes of protocol data.

  As an aside, it appears that the FIONREAD ioctl (sys_socket.c:soo_ioctl)
and stat(2) on a socket (sys_socket.c:soo_stat) also return the total
number of bytes (protocol data other otherwise) in the socket buffer.
For similar reasons as described above, I suspect that these should be
also modified to return just the number of bytes of actual data.  Unless
someone knows of an explicit example otherwise, I don't think changing the
value reported via these interfaces would break any existing applications
as they are probably expecting the new behaviour anyway.

  Thanks,

  Kelly

  (P.S. I've already sent a version of this patch, made against -stable,
   to Jonathan, but I haven't heard anything from him in almost a week)

--
Kelly Yancey --  kbyanc@{posi.net,FreeBSD.org}


Index: sys/socketvar.h
===
RCS file: /home/ncvs/src/sys/sys/socketvar.h,v
retrieving revision 1.94
diff -u -p -r1.94 socketvar.h
--- sys/socketvar.h 17 Aug 2002 02:36:16 -  1.94
+++ sys/socketvar.h 16 Oct 2002 21:34:13 -
@@ -105,6 +105,7 @@ struct socket {
u_int   sb_hiwat;   /* max actual char count */
u_int   sb_mbcnt;   /* chars of mbufs used */
u_int   sb_mbmax;   /* max chars of mbufs to use */
+   u_int   sb_ctl; /* non-data chars in buffer */
int sb_lowat;   /* low water mark */
int sb_timeo;   /* timeout for read/write */
short   sb_flags;   /* flags, see below */
@@ -227,6 +228,8 @@ struct xsocket {
 /* adjust counters in sb reflecting allocation of m */
 #definesballoc(sb, m) { \
(sb)->sb_cc += (m)->m_len; \
+   if ((m)->m_type != MT_DATA) \
+   (sb)->sb_ctl += (m)->m_len; \
(sb)->sb_mbcnt += MSIZE; \
if ((m)->m_flags & M_EXT) \
(sb)->sb_mbcnt += (m)->m_ext.ext_size; \
@@ -235,6 +238,8 @@ struct xsocket {
 /* adjust counters in sb reflecting freeing of m */
 #definesbfree(sb, m) { \
(sb)->sb_cc -= (m)->m_len; \
+   if ((m)->m_type != MT_DATA) \
+   (sb)->sb_ctl -= (m)->m_len; \
(sb)->sb_mbcnt -= MSIZE; \
if ((m)->m_flags & M_EXT) \
(sb)->sb_mbcnt -= (m)->m_ext.ext_size; \
Index: kern/uipc_socket.c
===
RCS file: /home/ncvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.132
diff -u -p -r1.132 uipc_socket.c
--- kern/uipc_socket.c  5 Oct 2002 21:23:46 -   1.132
+++ kern/uipc_socket.c  16 Oct 2002 21:32:01 -
@@ -1785,6 +1785,7 @@ filt_soread(struct knote *kn, long hint)
struct socket *so = (struct socket *)kn->kn_fp->f_data;
 
kn->kn_data = so->so_rcv.sb_cc;
+   kn->kn_data -= so->so_rcv.sb_ctl;
if (so->so_state & SS_CANTRCVMORE) {
kn->kn_flags |= EV_EOF;
kn->kn_fflags = so->so_error;



Re: ENOBUFS

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Petri Helenius wrote:

> >
> > just reading the source code, yes, it appears that the card has
> > support for delayed rx/tx interrupts -- see RIDV and TIDV definitions
> > and usage in sys/dev/em/* . I don't know in what units are the values
> > (28 and 128, respectively), but it does appear that tx interrupts are
> > delayed a bit more than rx interrupts.
> >
> The thing what is looking suspect is also the "small packet interrupt" feature
> which does not seem to get modified in the em driver but is on the defines.
>
> If that would be on by default, weĀ“d probably see interrupts "too often"
> because it tries to optimize interrupts for good throughput on small number
> of TCP streams.
>

  Hmm.  Might that explain the abysmal performance of the em driver with
packets smaller than 333 bytes?

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
FreeBSD, The Power To Serve: http://www.freebsd.org/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: ENOBUFS

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Luigi Rizzo wrote:

> On Fri, Oct 18, 2002 at 10:27:04AM -0700, Kelly Yancey wrote:
> ...
> >   Hmm.  Might that explain the abysmal performance of the em driver with
> > packets smaller than 333 bytes?
>
> what do you mean ? it works great for me. even on -current i
> can push out over 400kpps (64byte frames) on a 2.4GHz box.
>
>   luigi
>

  Using a SmartBit to push traffic across a 1.8Ghz P4; 82543 chipset card
plugged into PCI-X bus:

FrameSize   TxFramesRxFramesLostFrames  Lost (%)
330 249984  129518  120466  48.19
331 249144  127726  121418  48.73
332 248472  140817  107655  43.33
333 247800  247800  0   0

  It has no trouble handling frames 333 bytes or larger.  But for any frame
332 bytes or smaller we consistently see ~50% packet loss.  This same machine
easily pushes ~100Mps with the very same frame sizes using a bge card rather
than em.

  I've gotten the same results with both em driver version 1.3.14 and 1.3.15
on both FreeBSD 4.5 and 4.7 (all 4 combinations, that is).

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
FreeBSD, The Power To Serve: http://www.freebsd.org/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: ENOBUFS

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Luigi Rizzo wrote:

> How is the measurement done, does the box under test act as a router
> with the smartbit pushing traffic in and expecting it back ?
>
  The box has 2 interfaces, a fxp and a em (or bge).  The GigE interface is
configured with 7 VLANs.  THe SmartBit produces X byte UDP datagrams that go
through a Foundry ServerIron switch for VLAN tagging and then to the GigE
interface (where they are untagged).  The box is acting as a router and all
traffic is directed out the fxp interface where it returns to the SmartBit.

> The numbers are strange, anyways.
>
> A frame of N bytes takes (N*8+160) nanoseconds on the wire, which
> for 330-byte frames should amount to 100/(330*8+160) ~= 357kpps,
> not the 249 or so you are seeing. Looks as if the times were 40% off.
>

  Yeah, I've never made to much sense of the actual numbers myself.  Our
resident SmartBit expert runs the tests and provides me with the results.  I
use them more for getting an idea of the relative performance of one
configuration over another and not as absolute numbers themselves.  I'll check
with our resident expert and see if he can explain how it calculates those
numbers.  The point being, though, that there is an undeniable drop-off with
332 byte or smaller packets.  We have never seen any such drop-off using the
bge driver.

  Thanks,

  Kelly

>   cheers
>   luigi
>
> On Fri, Oct 18, 2002 at 10:45:13AM -0700, Kelly Yancey wrote:
> ...
> > > can push out over 400kpps (64byte frames) on a 2.4GHz box.
> > >
> > >   luigi
> > >
> >
> >   Using a SmartBit to push traffic across a 1.8Ghz P4; 82543 chipset card
> > plugged into PCI-X bus:
> >
> > FrameSize   TxFramesRxFramesLostFrames  Lost (%)
> > 330 249984  129518  120466  48.19
> > 331 249144  127726  121418  48.73
> > 332 248472  140817  107655  43.33
> > 333 247800  247800  0   0
> >
> >   It has no trouble handling frames 333 bytes or larger.  But for any frame
> > 332 bytes or smaller we consistently see ~50% packet loss.  This same machine
> > easily pushes ~100Mps with the very same frame sizes using a bge card rather
> > than em.
> >
> >   I've gotten the same results with both em driver version 1.3.14 and 1.3.15
> > on both FreeBSD 4.5 and 4.7 (all 4 combinations, that is).
> >
> >   Kelly
> >
> > --
> > Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
> > FreeBSD, The Power To Serve: http://www.freebsd.org/
> >
> >
> > To Unsubscribe: send mail to [EMAIL PROTECTED]
> > with "unsubscribe freebsd-net" in the body of the message
>
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
>

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
Join distributed.net Team FreeBSD: http://www.posi.net/freebsd/Team-FreeBSD/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: ENOBUFS

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Prafulla Deuskar wrote:

> FYI. 82543 doesn't support PCI-X protocol.
> For PCI-X support use 82544, 82545 or 82546 based cards.
>
> -Prafulla
>

  That is alright, we aren't expecting PCI-X speeds.  It is just that our only
PCI slot on the motherboard (1U rack-mount system) is a PCI-X slot.  Shouldn't
the 82543 still function normally but only as at PCI speeds?

  Thanks,

  Kelly

>
> Kelly Yancey [[EMAIL PROTECTED]] wrote:
> > On Fri, 18 Oct 2002, Luigi Rizzo wrote:
> >
> > > On Fri, Oct 18, 2002 at 10:27:04AM -0700, Kelly Yancey wrote:
> > > ...
> > > >   Hmm.  Might that explain the abysmal performance of the em driver with
> > > > packets smaller than 333 bytes?
> > >
> > > what do you mean ? it works great for me. even on -current i
> > > can push out over 400kpps (64byte frames) on a 2.4GHz box.
> > >
> > >   luigi
> > >
> >
> >   Using a SmartBit to push traffic across a 1.8Ghz P4; 82543 chipset card
> > plugged into PCI-X bus:
> >
> > FrameSize   TxFramesRxFramesLostFrames  Lost (%)
> > 330 249984  129518  120466  48.19
> > 331 249144  127726  121418  48.73
> > 332 248472  140817  107655  43.33
> > 333 247800  247800  0   0
> >
> >   It has no trouble handling frames 333 bytes or larger.  But for any frame
> > 332 bytes or smaller we consistently see ~50% packet loss.  This same machine
> > easily pushes ~100Mps with the very same frame sizes using a bge card rather
> > than em.
> >
> >   I've gotten the same results with both em driver version 1.3.14 and 1.3.15
> > on both FreeBSD 4.5 and 4.7 (all 4 combinations, that is).
> >
> >   Kelly
> >
> > --
> > Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
> > FreeBSD, The Power To Serve: http://www.freebsd.org/
> >
> >
> > To Unsubscribe: send mail to [EMAIL PROTECTED]
> > with "unsubscribe freebsd-net" in the body of the message
>
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-net" in the body of the message
>

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
"No nation is permitted to live in ignorance with impunity."
-- Thomas Jefferson, 1821.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: ENOBUFS

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Luigi Rizzo wrote:

> Oh, I *thought* the numbers you reported were pps but now i see that
> nowhere you mentioned that.
>

  Sorry.  I just checked with our tester.  Those are the total number of
packets sent during the test.  Each test lasted 10 seconds, so divde by 10 to
get pps.

> But if things are as you say, i am seriously puzzled on what you
> are trying to measure -- the output interface (fxp) is a 100Mbit/s
> card which cannot possibly support the load you are trying to offer
> to saturate the input link.
>

  We don't want to saturate the input link, only saturate the outbound link
(100Mps).  Oddly enough, the em card cannot do this with any packets less than
333 bytes and drops ~50% of the packets.  But clearly this isn't a bottlenext
issue because the drop-off isn't smooth.  332 byte backs cause ~50% packet
loss; 333 byte packets cause 0% packet loss.

> You should definitely clarify how fast the smartbits unit is pushing
> out traffic, and whether its speed depends on the measured RTT.
>

  It doesn't sound like the box is that smart.  As it was explained to me, the
test setup includes a desired 'load' to put on the wire: it is measured as a
percentage of the wire speed.  Since our SmartBit unit only supports
100base-T and doesn't understand vlans, we have to use 7 separate outbound
ports, each configured for 14.25% load.  To the GigE interface, this should
appear as 99.75 megabits of data (including all headers/framing).

> It might well be that what you are
> seeing is saturation of ipintrq, which happens because of some
> strange timing issue -- nothing to do with the board.
>

  I don't understand why it would only happen with the em card and not with
the bge under the exact same traffic (or even more demanding traffic, i.e.
64byte frames).  Also, wouldn't packet gradually subside as we approached the
333 byte magic limit rather than the sudden drop-off we are seeing?

> In any case, at least in my experience, a 1GHz box with two em
> cards can easily forward between 350 and 400kpps (64-byte frames) with a
> 4.6-ish kernel, and a 2.4GHz box goes above 650kpps.
>

  We expect our kernel to be slower than that (we typically see ~120kpps for
64-byte frames using the bge driver and a 5701-based card) because we are
using an fxp card for outbound traffic and have added additional code to the
ip_input() processing.  The point isn't absolute numbers, though, but trying
to figure out why when using the em driver (and only with the em driver!) we
see ~50% packet loss with packets smaller than 333 bytes (no matter what size,
just that it is smaller).  That is, 64 byte frames: ~50% packet loss; 332 byte
frames: ~50% packet loss; 333 byte frames: 0% packet loss.  That sort of
sudden drop doesn't look like a bottleneck to me.
  We've mostly written the em driver off because of this.  The bge driver
works just fine performance wise; it was the sporadic watchdog timeouts
that led us to investigate the Intel cards to begin with.  I only mentioned it
on-list because earlier Jim McGrath alluded to similar performance issues with
the Intel GigE cards and small frames.

  Actually, at this point, I'm hoping that your polling patches for the em
driver workaround whatever problem is causing the packet loss and am eagerly
awaiting them to be committed. :)

  Thanks,

  Kelly

>
> On Fri, Oct 18, 2002 at 11:13:54AM -0700, Kelly Yancey wrote:
> > On Fri, 18 Oct 2002, Luigi Rizzo wrote:
> >
> > > How is the measurement done, does the box under test act as a router
> > > with the smartbit pushing traffic in and expecting it back ?
> > >
> >   The box has 2 interfaces, a fxp and a em (or bge).  The GigE interface is
> > configured with 7 VLANs.  THe SmartBit produces X byte UDP datagrams that go
> > through a Foundry ServerIron switch for VLAN tagging and then to the GigE
> > interface (where they are untagged).  The box is acting as a router and all
> > traffic is directed out the fxp interface where it returns to the SmartBit.
> >
> > > The numbers are strange, anyways.
> > >
> > > A frame of N bytes takes (N*8+160) nanoseconds on the wire, which
> > > for 330-byte frames should amount to 100/(330*8+160) ~= 357kpps,
> > > not the 249 or so you are seeing. Looks as if the times were 40% off.
> > >
> >
> >   Yeah, I've never made to much sense of the actual numbers myself.  Our
> > resident SmartBit expert runs the tests and provides me with the results.  I
> > use them more for getting an idea of the relative performance of one
> > configuration over another and not as absolute numbers themselves.  I'll check
> > with our resident expert and see if he can explain how it calculates those
&

Re: ENOBUFS

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Kelly Yancey wrote:

> > You should definitely clarify how fast the smartbits unit is pushing
> > out traffic, and whether its speed depends on the measured RTT.
> >
>
>   It doesn't sound like the box is that smart.  As it was explained to me, the
> test setup includes a desired 'load' to put on the wire: it is measured as a
> percentage of the wire speed.  Since our SmartBit unit only supports
> 100base-T and doesn't understand vlans, we have to use 7 separate outbound
> ports, each configured for 14.25% load.  To the GigE interface, this should
> appear as 99.75 megabits of data (including all headers/framing).
>

  Oops.  That was actually the explanation of the SmartBits 'desired ILoad'
which I didn't quote in the posted numbers.  The actual number of packets
transmitted is based on RTT.  Sorry for the confusion,

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
"No nation is permitted to live in ignorance with impunity."
-- Thomas Jefferson, 1821.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Is anyone can help this ?

2002-10-18 Thread Kelly Yancey
On Fri, 18 Oct 2002, Feng Li wrote:
>
> Hi, Friends
>
> Could anyone advise me how to configure the Ethernet
> Card on my PC with speed=100Mbps, duplex=Full parameters ?
>
> My PC is running FreeBSD 3.1-Relase, the interface name
> is fxp0.
>
> The config example will be appreciated greatly !
>
> Thanks a lot in advnace !
>

  I don't remember if it is the same in 3.1, but per `man ifconfig` and `man
fxp`:

ifconfig fxp0 media 100baseTX mediaopt full-duplex
ifconfig fxp0 media 100BaseT mediaopt full-duplex

  By the way, this sort of question belongs on -questions.  Thanks,

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
Join distributed.net Team FreeBSD: http://www.posi.net/freebsd/Team-FreeBSD/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Performance of em driver (Was: ENOBUFS)

2002-10-30 Thread Kelly Yancey
On Fri, 18 Oct 2002, Kelly Yancey wrote:

>   Hmm.  Might that explain the abysmal performance of the em driver with
> packets smaller than 333 bytes?
>
>   Kelly
>

  This is just a follow-up to report that thanks to Luigi and Prafulla we
were able to track down the cause of the problems I was seeing with the em
driver/hardware.  In our test environment we had left the IP packet queue
(net.inet.ip.intr_queue_maxlen) at its default value of 50 which, when using
the em card, was overflowing causing the dropped packets.  While it is
curious that it was not overflowing using the bge card, clearly 50 packets
is a restrictive maximum queue size for any decent amount of traffic.

  Below are some of the results from our testing.  First, a note about the
methodology: traffic was generated using 7 10/100 ethernet ports of a
SmartBits 600 (each port was set to generate 14.25Mbps of traffic for a
aggregate of 99.75Mbps, slightly higher than the theoretical maximum
wirespeed).  The traffic was then VLAN tagged before being passed to a
1.8Ghz Pentium 4 running FreeBSD 4.5p19 where it was untagged and passed
back to the SmartBits.  The numbers quoted below are the actual amount of
traffic that was delivered back to the SmartBits.  The kernel involved
included a number of modifications proprietary to NTTMCL so the numbers are
going to differ from a stock kernel and I only present them for comparative
purposes between the different network configurations.  Also note that all
interfaces were configured for 100base-TX full-duplex.

  Frame Size
NICs  queue  ipfw   64  128  192
bge->fxp 50 0   79.708   97.325   98.124 Mbps
bge->fxp   1000 0   80.172   97.325   98.124 Mbps
em->fxp1000 0   77.590   97.325   98.124 Mbps
bge->fxp 5032   39.097   97.325   98.124 Mbps
bge->fxp   100032   62.011   97.325   98.124 Mbps
em->fxp100032   63.651   97.325   98.124 Mbps

  The numbers in the ipfw column are the number of non-matching rules in the
ruleset before an "allow all from any to any" rule.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org} -- [EMAIL PROTECTED]
"And say, finally, whether peace is best preserved by giving energy to the
 government or information to the people.  This last is the most certain and
 the most legitimate engine of government."
-- Thomas Jefferson to James Madison, 1787.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Raw sockets and splnet()

2002-12-13 Thread Kelly Yancey

  Is there any particular reason that the raw socket implementation in
net/raw_usrreq.c does not require splnet() protection?  It seems as though
adding splnet()/splx() calls to the various raw_* routines would greatly
reduce the size of net/rtsock.c, in which many of the routines simply wrap
their raw_ counterparts with splnet()/splx().
  Currently, it appears that routing sockets are the only consumer of the raw
socket interface at the moment, but if another consumer were to exist then
they would have to do the same splnet()/splx() hackery I imagine.  Wouldn't it
make sense to just put the logic into net/raw_usrreq.c and be done with it?

  Any insight would be appreciated.  Thanks,

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
Visit the BSD driver database: http://www.posi.net/freebsd/drivers/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Raw sockets and splnet()

2002-12-13 Thread Kelly Yancey
On Fri, 13 Dec 2002, Kelly Yancey wrote:

>
>   Is there any particular reason that the raw socket implementation in
> net/raw_usrreq.c does not require splnet() protection?  It seems as though
> adding splnet()/splx() calls to the various raw_* routines would greatly
> reduce the size of net/rtsock.c, in which many of the routines simply wrap
> their raw_ counterparts with splnet()/splx().
>   Currently, it appears that routing sockets are the only consumer of the raw
> socket interface at the moment, but if another consumer were to exist then
> they would have to do the same splnet()/splx() hackery I imagine.  Wouldn't it
> make sense to just put the logic into net/raw_usrreq.c and be done with it?
>
>   Any insight would be appreciated.  Thanks,
>
>   Kelly
>

  Actually, as a follow-up to my own question, I don't see how the
splnet()/splx() calls in rtsock.c are necessary at all as all of the pru_*
hooks are called at splnet().  Being that rtsock's pru_* hooks are called at
splnet(), is there any reason not to just extern the various raw_* pru hooks
and reference them directly from route_usrreqs?

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
FreeBSD, The Power To Serve: http://www.freebsd.org/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: Raw sockets and splnet()

2002-12-13 Thread Kelly Yancey
On Fri, 13 Dec 2002, Kelly Yancey wrote:

>   Actually, as a follow-up to my own question, I don't see how the
> splnet()/splx() calls in rtsock.c are necessary at all as all of the pru_*
> hooks are called at splnet().  Being that rtsock's pru_* hooks are called at
> splnet(), is there any reason not to just extern the various raw_* pru hooks
> and reference them directly from route_usrreqs?
>
>   Kelly

  For a better idea of what I am talking about, a diff against 4.7 is
attached.  I've confirmed that it compiles and will leave a machine running
with this patch up over the weekend.  Any comments would be appreciated,

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
FreeBSD, The Power To Serve: http://www.freebsd.org/

Index: raw_cb.h
===
RCS file: /home/cvs/acs/base/src/sys/net/raw_cb.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 raw_cb.h
--- raw_cb.h22 Mar 2002 04:11:00 -  1.1.1.1
+++ raw_cb.h14 Dec 2002 04:17:55 -
@@ -71,6 +71,19 @@ void  raw_input __P((struct mbuf *,
struct sockproto *, struct sockaddr *, struct sockaddr *));
 
 extern struct pr_usrreqs raw_usrreqs;
+
+int raw_uabort __P((struct socket *));
+int raw_uattach __P((struct socket *, int, struct proc *));
+int raw_ubind __P((struct socket *, struct sockaddr *, struct proc *));
+int raw_uconnect __P((struct socket *, struct sockaddr *, struct proc *));
+int raw_udetach __P((struct socket *));
+int raw_udisconnect __P((struct socket *));
+int raw_upeeraddr __P((struct socket *, struct sockaddr **));
+int raw_usend __P((struct socket *, int, struct mbuf *, struct sockaddr *,
+   struct mbuf *, struct proc *));
+int raw_ushutdown __P((struct socket *));
+int raw_usockaddr __P((struct socket *, struct sockaddr **));
+
 #endif
 
 #endif
Index: raw_usrreq.c
===
RCS file: /home/cvs/acs/base/src/sys/net/raw_usrreq.c,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 raw_usrreq.c
--- raw_usrreq.c22 Mar 2002 04:11:00 -  1.1.1.1
+++ raw_usrreq.c14 Dec 2002 04:17:55 -
@@ -135,7 +135,7 @@ raw_ctlinput(cmd, arg, dummy)
/* INCOMPLETE */
 }
 
-static int
+int
 raw_uabort(struct socket *so)
 {
struct rawcb *rp = sotorawcb(so);
@@ -150,7 +150,7 @@ raw_uabort(struct socket *so)
 
 /* pru_accept is EOPNOTSUPP */
 
-static int
+int
 raw_uattach(struct socket *so, int proto, struct proc *p)
 {
struct rawcb *rp = sotorawcb(so);
@@ -163,13 +163,13 @@ raw_uattach(struct socket *so, int proto
return raw_attach(so, proto);
 }
 
-static int
+int
 raw_ubind(struct socket *so, struct sockaddr *nam, struct proc *p)
 {
return EINVAL;
 }
 
-static int
+int
 raw_uconnect(struct socket *so, struct sockaddr *nam, struct proc *p)
 {
return EINVAL;
@@ -178,7 +178,7 @@ raw_uconnect(struct socket *so, struct s
 /* pru_connect2 is EOPNOTSUPP */
 /* pru_control is EOPNOTSUPP */
 
-static int
+int
 raw_udetach(struct socket *so)
 {
struct rawcb *rp = sotorawcb(so);
@@ -190,7 +190,7 @@ raw_udetach(struct socket *so)
return 0;
 }
 
-static int
+int
 raw_udisconnect(struct socket *so)
 {
struct rawcb *rp = sotorawcb(so);
@@ -207,7 +207,7 @@ raw_udisconnect(struct socket *so)
 
 /* pru_listen is EOPNOTSUPP */
 
-static int
+int
 raw_upeeraddr(struct socket *so, struct sockaddr **nam)
 {
struct rawcb *rp = sotorawcb(so);
@@ -224,7 +224,7 @@ raw_upeeraddr(struct socket *so, struct 
 /* pru_rcvd is EOPNOTSUPP */
 /* pru_rcvoob is EOPNOTSUPP */
 
-static int
+int
 raw_usend(struct socket *so, int flags, struct mbuf *m,
  struct sockaddr *nam, struct mbuf *control, struct proc *p)
 {
@@ -267,7 +267,7 @@ release:
 
 /* pru_sense is null */
 
-static int
+int
 raw_ushutdown(struct socket *so)
 {
struct rawcb *rp = sotorawcb(so);
@@ -278,7 +278,7 @@ raw_ushutdown(struct socket *so)
return 0;
 }
 
-static int
+int
 raw_usockaddr(struct socket *so, struct sockaddr **nam)
 {
struct rawcb *rp = sotorawcb(so);
Index: rtsock.c
===
RCS file: /home/cvs/acs/base/src/sys/net/rtsock.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 rtsock.c
--- rtsock.c23 Aug 2002 04:10:27 -  1.1.1.2
+++ rtsock.c14 Dec 2002 04:17:55 -
@@ -88,15 +88,6 @@ static void   rt_setmetrics __P((u_long, 
  * It really doesn't make any sense at all for this code to share much
  * with raw_usrreq.c, since its functionality is so restricted.  XXX
  */
-static int
-rts_abort(struct socket *so)
-{
-   int s, error;
-   s = splnet();
-   error = raw_usrreqs.pru_abort(so)

Radix nodes, netmasks, and bogus sockaddrs, oh my!

2003-01-06 Thread Kelly Yancey

  Is there any reason to fix the code in the kernel which assumes
rt_mask(rt) is a properly-formed sockaddr?

  For example, sys/net/rtsock.c:sysctl_dumpentry() just passes
rt_mask(rt)'s contents to userland to be interpretted as a sockaddr but it
seldomly is a properly-formed sockaddr (i.e. sa_family is almost always
garbage and sa_len is 0 for the default route).

  Nothing in the base system appears to care that the netmask isn't a
full-fledged sockaddr so it isn't hurting anything.  The main reason I ask
is that interfaces such as sysctl_rtable and routing sockets are currently
making stronger claims then they are living up to and I would be inclined
to fix it.  But if it were to be fixed, is there a preference for whether
it should be corrected in the routing table itself or just when the
information is exported?

  Thanks,

  Kelly


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: misc/44361: possible raw socket bug

2003-01-18 Thread Kelly Yancey
On Sat, 18 Jan 2003, Alfred Perlstein wrote:

> It appears that we expect the ip_len and ip_off feilds to be sent
> in host byte order as the stack will fix it to network byte order
> in ip_output.
>
> Is this a bug or feature? :)
>
> --
> -Alfred Perlstein [[EMAIL PROTECTED]]

  Both, no? :)  It's a bug documented in Stevens TCP/IP Illustrated 2 as
being around since 4.4BSD, but I would expect that fixing it would break a
good bit.  On the other hand, it is supposedly fixed in OpenBSD.

  Kelly

--
Kelly Yancey -- kbyanc@{posi.net,FreeBSD.org}
"The fact that a believer is happier than a skeptic is no more to the point
 than the fact than a drunken man is happier than a sober one."
-- George Bernard Shaw


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



1168 octets payload and bad TCP checksums

2004-01-02 Thread Kelly Yancey

  We've got Broadcom BCM5701 cards configured for vlan tagging on a
FreeBSD 4.7 based router; a vlan switch then terminates the trunked
segment and splits it into separate physical subnets.  It turns out that
hosts on those segments cannot receive TCP packets with precisely 1168
octets of payload (ethernet frame size 1222 octets) as the checksum is
always incorrect.  We've manually backported all of the bge driver updates
from 4-stable, but to no avail.
  What is particularly odd is that the checksums are always wrong by the
same amount: 0xAC48 (the dump below only shows retries of the same
packet, but the difference is the same even for other packets).
Furthermore, it appears only TCP packets with 1168 octets of data are
affected.  I cannot easily create an environment without the vlans to
determine whether or not tagging is related.  Note also, that the IP
checksum is correct.

  Has anyone else experienced similar problems?  Does anyone have a clue
where to begin to track down the problem?  Currently I'm looking at the
tcp checksum calculation (tcp_fillheaders), but I don't really see how
that could be the culprit as if such a bug existed there, it would affect
all interfaces and surely would have been noticed by now.  At the same
time, I don't see anywhere else offhand the problem could be.  Again, if
anyone has any advice, I would greatly appreciate it.  Thanks,

  Kelly


bge0: flags=8843 mtu 1504
options=3
inet 10.30.3.254 netmask 0xfff8 broadcast 10.30.3.255
ether 00:00:5e:00:01:4b
media: Ethernet autoselect (100baseTX )
status: active
vlan9: flags=8843 mtu 1500
inet 10.30.3.1 netmask 0xfff8 broadcast 10.30.3.7
ether 00:00:5e:00:01:4b
vlan: 9 parent interface: bge0
vlan10: flags=8843 mtu 1500
inet 10.30.3.9 netmask 0xfff8 broadcast 10.30.3.15
ether 00:00:5e:00:01:4b
vlan: 10 parent interface: bge0

  Extract from tcpdump -vvv taken on host 216.69.90.56 connected to
FreeBSD router via vlan10 interface:

11:38:55.665425 216.69.68.198.22 > 216.69.90.56.3335: . [tcp sum ok] 561:2021(1460) 
ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57881, len 1500)
11:38:55.666782 216.69.68.198.22 > 216.69.90.56.3335: P [tcp sum ok] 2021:2049(28) ack 
432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57882, len 68)
11:38:55.666839 216.69.90.56.3335 > 216.69.68.198.22: . [tcp sum ok] 432:432(0) ack 
2049 win 17520 (DF) (ttl 128, id 57057, len 40)
11:38:55.668899 216.69.68.198.22 > 216.69.90.56.3335: P [bad tcp cksum 1de3!] 
2049:3217(1168) ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57883, len 1208)
11:38:55.920110 216.69.68.198.22 > 216.69.90.56.3335: P [bad tcp cksum 1de3!] 
2049:3217(1168) ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57884, len 1208)
11:38:56.419788 216.69.68.198.22 > 216.69.90.56.3335: P [bad tcp cksum 1de3!] 
2049:3217(1168) ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57885, len 1208)
11:38:56.442824 216.69.224.134 > 216.69.90.56: icmp: echo request (ttl 108, id 24195, 
len 92)
11:38:57.419622 216.69.68.198.22 > 216.69.90.56.3335: P [bad tcp cksum 1de3!] 
2049:3217(1168) ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57886, len 1208)
11:38:58.098535 216.69.90.56.3337 > 216.69.68.197.53: [udp sum ok]  12575+ PTR? 
56.90.69.216.in-addr.arpa. (43) (ttl 128, id 57060, len 71)
11:38:58.098868 216.69.90.56.3337 > 216.69.68.197.53: [udp sum ok]  12576+ PTR? 
1.90.69.216.in-addr.arpa. (42) (ttl 128, id 57061, len 70)
11:38:58.102453 216.69.68.197.53 > 216.69.90.56.3337: [udp sum ok]  12575 NXDomain* q: 
PTR? 56.90.69.216.in-addr.arpa. 0/1/0 ns: 90.69.216.in-addr.arpa. SOA ns.nttmcl.com. 
hostmaster.nttmcl.com. 2002111000 7200 3600 1209600 432000 (103) (ttl 59, id 43147, 
len 131)
11:38:58.103689 216.69.68.197.53 > 216.69.90.56.3337: [udp sum ok]  12576 NXDomain* q: 
PTR? 1.90.69.216.in-addr.arpa. 0/1/0 ns: 90.69.216.in-addr.arpa. SOA ns.nttmcl.com. 
hostmaster.nttmcl.com. 2002111000 7200 3600 1209600 432000 (102) (ttl 59, id 63562, 
len 130)
11:38:59.419902 216.69.68.198.22 > 216.69.90.56.3335: P [bad tcp cksum 1de3!] 
2049:3217(1168) ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57887, len 1208)
11:39:03.419776 216.69.68.198.22 > 216.69.90.56.3335: P [bad tcp cksum 1de3!] 
2049:3217(1168) ack 432 win 14352 (DF) [tos 0x10]  (ttl 59, id 57888, len 1208)
11:39:06.305954 216.69.90.56.3335 > 216.69.68.198.22: P [tcp sum ok] 432:480(48) ack 
2049 win 17520 (DF) (ttl 128, id 57062, len 88)
11:39:06.344820 216.69.68.198.22 > 216.69.90.56.3335: . [tcp sum ok] 3217:3217(0) ack 
480 win 14352 (DF) [tos 0x10]  (ttl 59, id 57889, len 40)
11:39:07.031807 216.69.90.56.3335 > 216.69.68.198.22: P [tcp sum ok] 480:528(48) ack 
2049 win 17520 (DF) (ttl 128, id 57065, len 88)
11:39:07.035322 216.69.68.198.22 > 216.69.90.56.3335: . [tcp sum ok] 3217:3217(0) ack 
528 win 14352 (DF) [tos 0x10]  (ttl 59, id 57890, len 40)



___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.

Re: bge data corruption bug (was: 1168 octets payload and bad TCPchecksums)

2004-01-16 Thread Kelly Yancey

On Tue, 13 Jan 2004, Kelly Yancey wrote:

>
> On Fri, 2 Jan 2004, Kelly Yancey wrote:
>
> >
> >   We've got Broadcom BCM5701 cards configured for vlan tagging on a
> > FreeBSD 4.7 based router; a vlan switch then terminates the trunked
> > segment and splits it into separate physical subnets.  It turns out that
> > hosts on those segments cannot receive TCP packets with precisely 1168
> > octets of payload (ethernet frame size 1222 octets) as the checksum is
> > always incorrect.  We've manually backported all of the bge driver updates
> > from 4-stable, but to no avail.
> >   What is particularly odd is that the checksums are always wrong by the
> > same amount: 0xAC48 (the dump below only shows retries of the same
> > packet, but the difference is the same even for other packets).
> > Furthermore, it appears only TCP packets with 1168 octets of data are
> > affected.  I cannot easily create an environment without the vlans to
> > determine whether or not tagging is related.  Note also, that the IP
> > checksum is correct.
> >
>
>   First, once slight clarification to my original posting: the received
> from, after vlan untagging is 1222 octets; the sent frame includes a tag
> so it is 1226 octets.
>
>   Anyway, it appears that the cause of the bad checksums are that the last
> dword of the transmitted frame is getting corrupted in hardware.
>
[ .. snip .. ]
>   So far, we have only been able to reproduce the problem with TCP packets
> with 1168 octets of payload, using vlan tagging on the bge interface.
[ .. snip .. ]

  Final update, just for the record: it turns out that, after adjusting
for the difference in header sizes, the bug is easily reproduceable using
ping with 1177 to 1180 bytes of payload.  So, it isn't just TCP, and it
isn't just 1222 byte (1126 with vlan tag) ethernet frames.  It is a
definate 4-byte window of 1219 to 1222 byte packets.  Furthermore, the
corruption is caused by the hardware apparently copying the dword 3rd from
the end of the packet into the last dword of the frame.  You can see this
in the dumps in my previous posting, but using ping makes the problem
really stand out.  For example, the server sends a ICMP echo request which
ends with:

# tcpdump -Xx -s 4000 -pni vlan9 icmp
[ snip ]
0x04a0   8485 8687 8889 8a8b 8c8d 8e8f 9091 9293
0x04b0   9495 9697 9899 9a9b

  Then the client receives:
# tcpdump -Xx -s 4000 -pni an0 icmp
[ snip ]
0x04a0   8485 8687 8889 8a8b 8c8d 8e8f 9091 9293
0x04b0   9495 9697 9091 9293

  I've verified this with different clients, running both FreeBSD and
Windows, and using different NICs on the client side.  Swapping out the
bge interface for one supported by the sk or em driver solves the problem.

  The workaround that we have found for the bge interface is to simply set
the LINK0 flag on the vlan interfaces.  I guess something about letting
the hardware add the vlan tag keeps it from mangling our packets.  Which
means that this bug only affects -stable as sam's 1.44 delta avoids the
issue on FreeBSD 5.0 and higher.  In any event, we have our solution; if
anyone else out there is using a bge card as a vlan parent interface on a
4.x box, consider yourself warned: enable LINK0 or face seemingly random
data corruption.

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: truncated-ip problem

2004-01-27 Thread Kelly Yancey
On Tue, 27 Jan 2004, Kai Mosebach wrote:

> Dear lists,
>
> lately i installed my netgear wg511 pci, trying to run it as a AP on my
> FreeBSD 5.2-RELEASE using the ath driver.
>
> Now I run into these problems :
>
> dhcp requests are not answered, pings don't work (traffic at all is
> unstable)
>
> a tcpdump results in this :
>
> -bash-2.05b# tcpdump -e -vvv -i ath0
> tcpdump: listening on ath0
> 17:42:45.311390 0:9:5b:84:56:7f Broadcast ip 342: 0.0.0.0.bootpc >
> 255.255.255.255.bootps:  xid:0x1d24ed9c [|bootp] (ttl 128, id 1547, len 328)
> 17:42:45.337508 0:9:5b:84:56:7f Broadcast ip 342: truncated-ip - 18105 bytes
> missing! 0.0.0.0.bootpc > 255.255.255.255.bootps:  xid:0x1d24ed9c [|bootp]
> (ttl 128, id 1547, len 18433, bad cksum 339b!)

  Try adding -S 2 to your tcpdump command-line.  This wouldn't be cause of
your connectivity problems, but would reduce the noise in your tcpdumps.
Tcpdump cannot calculate the checksums you requested by specifying -vvv unless
it has the entire packet to work with.

  Kelly
--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Intel 82541 Gigabit Ethernet

2004-07-05 Thread Kelly Yancey
On Mon, 5 Jul 2004, Karim Fodil-Lemelin wrote:

> I was going through the hardware notes. We already use Intel's NIC's and
> are planning to switch to Dual IntelĀ® 82541 Gigabit Ethernet. Its not
> listed in the the hardware notes but we use the 82555 on FBSD4.8 (fxp)
> driver and it works fine (and it is not listed either). Is anyone
> working with it or have tried it (the 82541)? What is the status? any
> plans for a driver that would support those? Also, will the fxp driver
> work  (enabling functionality at 100Mbps only) but a
> patch/another_driver is required to get the gigabit?
>

  It is listed in the README for the em driver:

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/em/README?rev=1.1.2.8&content-type=text/x-cvsweb-markup

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
Join distributed.net Team FreeBSD: http://www.posi.net/freebsd/Team-FreeBSD/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


RE: device polling takes more CPU hits??

2004-07-26 Thread Kelly Yancey
On Mon, 26 Jul 2004, Don Bowman wrote:

> kern.polling.burst: 1000
> kern.polling.each_burst: 80
> kern.polling.burst_max: 1000
> kern.polling.idle_poll: 1
> kern.polling.poll_in_trap: 0
> kern.polling.user_frac: 5
> kern.polling.reg_frac: 120
> kern.polling.short_ticks: 29
> kern.polling.lost_polls: 55004
> kern.polling.pending_polls: 0
> kern.polling.residual_burst: 0
> kern.polling.handlers: 4
> kern.polling.enable: 1
> kern.polling.phase: 0
> kern.polling.suspect: 50690
> kern.polling.stalled: 25

  Out of curiousity, what sort of testing did you do to arrive at these
settings?  I did some testing a while back with a SmartBits box pumping
packets through a FreeBSD 2.8Ghz box configured to route between two em
gigabit interfaces; I found that changing the burst_max and each_burst
parameters had almost no effect on throughput (maximum 1% difference).
That was completely contrary to expectations and would love to hear how I
could improve my test setup to see how changing those values are supposed
to affect performance.

  Thanks,

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
"The information of the people at large can alone make them the safe as they
 are the sole depositary of our political and religious freedom."
-- Thomas Jefferson to William Duane, 1810. ME 12:417
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: device polling takes more CPU hits??

2004-07-26 Thread Kelly Yancey
On Mon, 26 Jul 2004, Luigi Rizzo wrote:

> On Mon, Jul 26, 2004 at 01:18:46PM -0700, Kelly Yancey wrote:
> ...
> >   Out of curiousity, what sort of testing did you do to arrive at these
> > settings?  I did some testing a while back with a SmartBits box pumping
> > packets through a FreeBSD 2.8Ghz box configured to route between two em
> > gigabit interfaces; I found that changing the burst_max and each_burst
> > parameters had almost no effect on throughput (maximum 1% difference).
>
> fast boxes are pci-bus limited, not CPU limited(*) so changing the burst
> size (which basically amortizes some CPU costs) has little if any
> effect.
>
> (*) this doesn't mean that the box cannot livelock, as depending on
> the traffic on the bus, the CPU might stall for long intervals
> waiting for bus transactions to complete, and becomes unable to
> do anything at all. So you might still need polling.
>
>   cheers
>   luigi
>

  Oh, I found polling to be vastly superior to interrupts under load on
the test machine.  Not only did it avoid livelock, the throughput was
about 10Mbps higher for small (64-byte) frames.  I just didn't find much
difference whether I used small burst sizes versus large burst sizes.  It
may have had to do with the fact that both the sending and receiving
interfaces were gigabit em cards and were polling (no interrupts from the
NICs at all).

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
Join distributed.net Team FreeBSD: http://www.posi.net/freebsd/Team-FreeBSD/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


RE: device polling takes more CPU hits??

2004-07-26 Thread Kelly Yancey
On Mon, 26 Jul 2004, Don Bowman wrote:

> From: Luigi Rizzo [mailto:[EMAIL PROTECTED]
> > On Mon, Jul 26, 2004 at 01:18:46PM -0700, Kelly Yancey wrote:
> > ...
> > >   Out of curiousity, what sort of testing did you do to
> > arrive at these
> > > settings?  I did some testing a while back with a SmartBits
> > box pumping
> > > packets through a FreeBSD 2.8Ghz box configured to route
> > between two em
> > > gigabit interfaces; I found that changing the burst_max and
> > each_burst
> > > parameters had almost no effect on throughput (maximum 1%
> > difference).
> >
> > fast boxes are pci-bus limited, not CPU limited(*) so
> > changing the burst
> > size (which basically amortizes some CPU costs) has little if any
> > effect.
>
> The PCI-X bus will probably be 64-bit 133MHz in this case,
> the limit moves up to the P64H2 hub for large packets,
> to the CPU for small packets. Polling becomes quite
> critical to prevent livelock.
>

  Sorry, I should be been more clear.  Polling certainly stopped livelock
under extreme load, however I never found much difference whether the
burst size was small or large.  I was wondering if it was just the nature
of my test and if in other environments the burst_max and each_burst knobs
have a greater affect.

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
FreeBSD, The Power To Serve: http://www.freebsd.org/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: per-interface packet filters [summary]

2004-12-15 Thread Kelly Yancey
On Tue, 14 Dec 2004, Gleb Smirnoff wrote:

> On Tue, Dec 14, 2004 at 01:47:35PM +0100, Andre Oppermann wrote:
> A> > Implementationwise, the kernel side is evidently trivial as the
> A> > original code already supports the idea of multiple chains.  All
> A> > you need is to extend the struct ifnet with a pointer to the chain,
> A> > or use some other trick (e.g. going through ifindex) to quickly
> A> > associate a chain to the input (and possibly output) interface.
> A>
> A> Nonononononononononononononononononononononono.
> A>
> A> There MUST NOT be any firewall specific pointers or other information
> A> in struct ifnet or any other non-firewall private part of the kernel.
> A> Otherwise the entire independence we've gained with the nice and clean
> A> PFIL_HOOKS API goes down the drain.  This MUST NOT happen again.
> A>
> A> The whole idea of the PFIL_HOOKS is to have independend and loadable
> A> firewall modules with different approaches, internal designs and so
> A> on.
>
> The whole idea of PFIL_HOOKS is to have independend and loadable firewall
> modules, which can be attached to different parts of kernel! There is no
> such requirement that, pfil hooks MUST be sticked to a single entry point
> in ip_input() and ip_output().
>
> Pfils attached to interface belong to interface, and thus should be stored
> in struct ifnet. This is the way it is done in per-interface filters.
>
> A> For example a way Gleb can get his way without any bickering from us
> A> is by creating his own gleb-firewall module using the PFIL_HOOKS API
> A> and put it into the ports tree for easy access, provided he doesn't
> A> modify the PFIL_HOOKS API (which he doesn't have to).
>
> I am not going to create a new firewall or change PFIL_HOOKS. I'm going
> to attach *the existing* pfil_hooks to a different place, to perform
> filtering with *existing* firewalls.

  How about a generic per-interface pfil demultiplexer?  That is, a module
that uses the existing pfil hooks to in turn call per-interface hooks.
As Luigi suggested earlier, it would be possible to use the interface
index to index an array private to the multiplexer's implementation.
If each element in this array had its own pfil_head, then the demultiplexer
could then call pfil_run_hooks() using that list.  This would allow you
to have your per-interface hooks in a generic way without changing a line
of existing code.  It could be entirely encapsulated in kld.  Provided an
API to manipulate the per-interface pfil registration, you could even run
different filters on different interfaces.
  You'de even have a chance of back-porting it to FreeBSD 5.x since you
won't be changing the ifnet structure at all.

  Just a thought,

  Kelly

-- 
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
Join distributed.net Team FreeBSD: http://www.posi.net/freebsd/Team-FreeBSD/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: per-interface packet filters [summary]

2004-12-15 Thread Kelly Yancey
On Thu, 16 Dec 2004, Andre Oppermann wrote:

> Kelly Yancey wrote:
> >
> >   How about a generic per-interface pfil demultiplexer?  That is, a module
> > that uses the existing pfil hooks to in turn call per-interface hooks.
> > As Luigi suggested earlier, it would be possible to use the interface
> > index to index an array private to the multiplexer's implementation.
> > If each element in this array had its own pfil_head, then the demultiplexer
> > could then call pfil_run_hooks() using that list.  This would allow you
> > to have your per-interface hooks in a generic way without changing a line
> > of existing code.  It could be entirely encapsulated in kld.  Provided an
> > API to manipulate the per-interface pfil registration, you could even run
> > different filters on different interfaces.
> >   You'de even have a chance of back-porting it to FreeBSD 5.x since you
> > won't be changing the ifnet structure at all.
>
> You'd have to change all firewall packages too.  Currently they are not
> aware of and can't deal with multiple rule chain heads.  The is the
> second main problem of Gleb implementation proposal so far.
>
> Nothing prevents generic routines to have the demultiplexer you describe
> but it's use and handling has to be inside each firewall package.
>

  Absolutely.  You could only use such a demultiplexer to select which
interfaces filters would apply to.  The issue of implementing different
behavior depending on the interface (e.g. a firewall implementing
per-interface rulesets) is necessarily a matter for the filter not the
framework.
  That said, since we have 3 firewall implementations, you could use the
demultiplexer to have 3 different sets of rules, each applied to a different
subset of the interfaces. :)

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
"An enlightened people, and an energetic public opinion... will control and
 enchain the aristocratic spirit of the government." --Thomas Jefferson
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Patch to set TCP_NOPUSH on libfetch HTTP connections

2005-02-11 Thread Kelly Yancey
...
0x0030   0538 ae0f  .8..
13:19:42.871415 216.69.64.149.80 > 216.69.71.45.1390: . ack 109 win 17376 
 (DF)
0x   4500 0034 069a 4000 3c06 ffdc d845 4095[EMAIL 
PROTECTED]<[EMAIL PROTECTED]
0x0010   d845 472d 0050 056e 9ee9 05ca 4cb6 e3ce.EG-.P.nL...
0x0020   8010 43e0 8241  0101 080a 2c8c bd83..C..A..,...
0x0030   0538 ae0f  .8..
13:19:42.871449 216.69.71.45.1390 > 216.69.64.149.80: R 
1287054286:1287054286(0) win 0
0x   4500 0028 a6c8  4006 9bba d845 472dE..([EMAIL PROTECTED]
0x0010   d845 4095 056e 0050 4cb6 e3ce  [EMAIL PROTECTED]
0x0020   5004  4150 P...AP..
  The attached patch sets the TCP_NOPUSH option on the socket and uses 
shutdown(conn->sd, SHUT_WR) at the end of the HTTP request in order to 
force the entire HTTP request to be coelesced into a minimum number of 
packets.  With the attached patch applied, the same request shown above 
appears on the wire as:

13:17:10.659049 216.69.71.45.2218 > 216.69.64.149.80: S 2067322044:2067322044(0) win 
57344  (DF)
0x   4500 003c 9c27 4000 4006 6647 d845 472dE..<.'@[EMAIL PROTECTED]
0x0010   d845 4095 08aa 0050 7b38 d4bc  [EMAIL PROTECTED]
0x0020   a002 e000 61f6  0204 05b4 0103 0300a...
0x0030   0101 080a 0538 729c    .8r.
13:17:10.663461 216.69.64.149.80 > 216.69.71.45.2218: S 3505347452:3505347452(0) ack 
2067322045 win 17376  
(DF)
0x   4500 003c da68 4000 3c06 2c06 d845 4095E..<[EMAIL 
PROTECTED]<.,[EMAIL PROTECTED]
0x0010   d845 472d 0050 08aa d0ef 5b7c 7b38 d4bd.EG-.P[|{8..
0x0020   a012 43e0 e8b9  0204 05b4 0103 0300..C.
0x0030   0101 080a 2c8c bc53 0538 729c  ,..S.8r.
13:17:10.663510 216.69.71.45.2218 > 216.69.64.149.80: . ack 1 win 57920 
 (DF)
0x   4500 0034 9c28 4000 4006 664e d845 472dE..4.(@[EMAIL PROTECTED]
0x0010   d845 4095 08aa 0050 7b38 d4bd d0ef 5b7d[EMAIL PROTECTED]
0x0020   8010 e240 761d  0101 080a 0538 729c[EMAIL PROTECTED]
0x0030   2c8c bc53  ,..S
13:17:10.664197 216.69.71.45.2218 > 216.69.64.149.80: FP 1:108(107) ack 1 win 57920 
 (DF)
0x   4500 009f 9c29 4000 4006 65e2 d845 472dE)@[EMAIL PROTECTED]
0x0010   d845 4095 08aa 0050 7b38 d4bd d0ef 5b7d[EMAIL PROTECTED]
0x0020   8019 e240 df70  0101 080a 0538 729c[EMAIL PROTECTED]
0x0030   2c8c bc53 4745 5420 2f6e 6f6e 6578 6973,..SGET./nonexis
0x0040   7465 6e74 2e68 746d 6c20 4854 5450 2f31tent.html.HTTP/1
0x0050   2e31 0d0a 486f 7374 3a20  772e 6e74.1..Host:.www.nt
0x0060   746d 636c 2e63 6f6d 0d0a 5573 6572 2d41tmcl.com..User-A
0x0070   6765 6e74 3a20 6665 7463 6820 6c69 6266gent:.fetch.libf
0x0080   6574 6368 2f32 2e30 0d0a 436f 6e6e 6563etch/2.0..Connec
0x0090   7469 6f6e 3a20 636c 6f73 650d 0a0d 0a  tion:.close
13:17:10.669275 216.69.64.149.80 > 216.69.71.45.2218: . ack 109 win 17269 
 (DF)
0x   4500 0034 8371 4000 3c06 8305 d845 4095[EMAIL 
PROTECTED]<[EMAIL PROTECTED]
0x0010   d845 472d 0050 08aa d0ef 5b7d 7b38 d529.EG-.P[}{8.)
0x0020   8010 4375 147d  0101 080a 2c8c bc53..Cu.}..,..S
0x0030   0538 729c  .8r.
13:17:10.670352 216.69.64.149.80 > 216.69.71.45.2218: F 514:514(0) ack 109 win 17376 
 (DF)
0x   4500 0034 ebbd 4000 3c06 1ab9 d845 4095[EMAIL 
PROTECTED]<[EMAIL PROTECTED]
0x0010   d845 472d 0050 08aa d0ef 5d7e 7b38 d529.EG-.P]~{8.)
0x0020   8011 43e0 1210  0101 080a 2c8c bc53..C.,..S
0x0030   0538 729c  .8r.
13:17:10.670378 216.69.71.45.2218 > 216.69.64.149.80: . ack 1 win 57920 
 (DF)
0x   4500 0034 9c2a 4000 4006 664c d845 472d[EMAIL 
PROTECTED]@.fL.EG-
0x0010   d845 4095 08aa 0050 7b38 d529 d0ef 5b7d[EMAIL PROTECTED])..[}
0x0020   8010 e240 75b0  0101 080a 0538 729d[EMAIL PROTECTED]
0x0030   2c8c bc53  ,..S
13:17:10.672885 216.69.64.149.80 > 216.69.71.45.2218: P 1:514(513) ack 109 win 17376 
 (DF)
[ snip file contents ]
13:17:10.672906 216.69.71.45.2218 > 216.69.64.149.80: . ack 515 win 57407 
 (DF)
0x   4500 0034 9c2b 4000 4006 664b d845 472d[EMAIL 
PROTECTED]@.fK.EG-
0x0010   d845 4095 08aa 0050 7b38 d529 d0ef 5d7f[EMAIL PROTECTED])..].
0x0020   8010 e03f 75af  0101 080a 0538 729d...?u8r.
0x0030   2c8c bc53
  Thus reducing the number of packets on the wire from 14 to 9.  Obviously 
for larger transfers, the difference gets lost in the noise.  Nonetheless, 
unless someone spots some undesireable side-effect that may be caused 
by the change, I'll commit 

Re: Patch to set TCP_NOPUSH on libfetch HTTP connections

2005-02-14 Thread Kelly Yancey

On Sat, 12 Feb 2005, Bruce M Simpson wrote:
On Fri, Feb 11, 2005 at 01:34:21PM -0800, Kelly Yancey wrote:
  Thus reducing the number of packets on the wire from 14 to 9.  Obviously
for larger transfers, the difference gets lost in the noise.  Nonetheless,
unless someone spots some undesireable side-effect that may be caused
by the change, I'll commit the attached patch in a few days.
Aren't there situations where the write-path should be kept open e.g.
in HTTP/1.1 ?
  That fetch uses?  No.
  Kelly
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


[patch] Update to libfetch

2005-02-21 Thread Kelly Yancey

  Attached is a patch to address concerns raised by Pawel Worach with
regards to the recent change to set TCP_NOPUSH when sending HTTP
requests from libfetch.  The previous revision also introduced a call
to shutdown(2) to close the write half of the socket in order to force
the queued request to be sent.  While this should be perfectly
acceptable behavior for a TCP client, it appears that squid provides a
configuration option to disallow half-closed clients (which Pawel is
currently using).  As such, after introducing the shutdown(2) call,
fetch(1) can no longer fetch files via HTTP through such proxies.
  To address this issue, the attached patch replaces the call to
shutdown(2) with some socket option fiddling (clearing TCP_NOPUSH and
setting TCP_NODELAY) which does the same job of forcing the client to
write the queued request to the network without closing the write half
of the socket.  This feels a bit hackish to me, but gets the job done.
Anyway, I would appreciate any feedback.  Thanks,

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]? fetch.3.gz
? ftperr.h
? httperr.h
? libfetch.so.3
? ~fetch-nodelay.diff
Index: http.c
===
RCS file: /home/ncvs/src/lib/libfetch/http.c,v
retrieving revision 1.75
diff -u -p -r1.75 http.c
--- http.c  16 Feb 2005 00:22:20 -  1.75
+++ http.c  21 Feb 2005 22:29:16 -
@@ -792,7 +792,7 @@ _http_request(struct url *URL, const cha
conn_t *conn;
struct url *url, *new;
int chunked, direct, need_auth, noredirect, verbose;
-   int e, i, n;
+   int e, i, n, val;
off_t offset, clength, length, size;
time_t mtime;
const char *p;
@@ -913,7 +913,20 @@ _http_request(struct url *URL, const cha
_http_cmd(conn, "Range: bytes=%lld-", (long 
long)url->offset);
_http_cmd(conn, "Connection: close");
_http_cmd(conn, "");
-   shutdown(conn->sd, SHUT_WR);
+
+   /*
+* Force the queued request to be dispatched.  Normally, one
+* would do this with shutdown(2) but squid proxies can be
+* configured to disallow such half-closed connections.  To
+* be compatible with such configurations, fiddle with socket
+* options to force the pending data to be written.
+*/
+   val = 0;
+   setsockopt(conn->sd, IPPROTO_TCP, TCP_NOPUSH, &val,
+  sizeof(val));
+   val = 1;
+   setsockopt(conn->sd, IPPROTO_TCP, TCP_NODELAY, &val,
+  sizeof(val));
 
/* get reply */
switch (_http_get_reply(conn)) {
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: tcpdump/bpf and seeing .1q tags

2005-03-09 Thread Kelly Yancey
On Wed, 9 Mar 2005, Charlie Schluting wrote:

> Charlie Schluting wrote:
> > Charles Swiger wrote:
> >
> >> On Mar 9, 2005, at 2:22 PM, Charlie Schluting wrote:
> >>
> >>> More importantly, I'm trying to figure out if a bpf read will see
> >>> them as well. Any insight on this?
> >>
> >>
> >>
> >> Yes, or it will if you use promisc mode and an appropriate BPF filter:
> >>
> >
> > So promisc is enabled in my case.
> >
> > This seems to imply that the bpf will always see the vlan tags. (I don't
> > want to.. that was the point of my question)
> >
> > I believe this is starting to make sense. Thanks for your reply.
>
> Oh! Er.. I hit send too fast.
>
> So a BPF is supposed to ignore vlan tags unless 'vlan' is specified??
>

  Worse: tcpdump has not idea there is a tag on the packet causing any
other filters to compare against the wrong data in the packet.  For this
reason, if you are going to run tcpdump on a parent interface, you need
to either specify no filter criteria or else specify the 'vlan' keyword
so tcpdump knows what it is getting.
  You'll have a similar issue with BPF programs you write: you'll either
need to skip over the vlan tag header or not, depending on whether you
snagged the packet from the parent interface or the vlan interface.

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
"And say, finally, whether peace is best preserved by giving energy to the
 government or information to the people.  This last is the most certain and
 the most legitimate engine of government."
-- Thomas Jefferson to James Madison, 1787.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bge and checksums

2005-03-24 Thread Kelly Yancey
On Thu, 24 Mar 2005, Boris Kovalenko wrote:

> Hello!
>
>   I try to use DSNiff with my FreeBSD 5.4-PRE and bge NIC. Unfortunatelly
> it does no work. My supposition is that the root of problem is bad tcp
> checksums (as shown by tcpdump). And DSNiff (and underlaying libnids)
> are checking for checksums. As I undrestand, bge has txcsum flag, so tcp
> stack does not computes checksum itself. Am I right? And may I turn off
> txcsum flag without modifying bge driver?
>

  Have you tried the -txcsum option described in ifconfig(8)?

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
FreeBSD, The Power To Serve: http://www.freebsd.org/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: SIOCGIFMEDIA problems

2005-05-16 Thread Kelly Yancey
On Mon, 16 May 2005, Bruce M Simpson wrote:

> On Mon, May 16, 2005 at 02:31:36PM +0200, Sebastien Petit wrote:
> > As I can see in kqueue man, I can only monitor events by file descriptor 
> > (read/write), a process id, a signal or a timer (under NetBSD 2)
> > How I can use it for monitoring link status change on a network card ?
>
> You need to use EVFILT_NETDEV and that may only be implemented on FreeBSD
> to the best of my knowledge. See kqueue(2) on FreeBSD for more details.
>

  Couldn't the same be accomplished simply by reading a routing socket?
Of course, one could use kqueue(2), libevent, or whatever to get
event-driven notification of routing socket updates.  That is exactly
what I do at work since before EVFILT_NETDEV was added.  As far as I can
tell, the only advantage EVFILT_NETDEV has is that you don't have to
weed through routing messages to get the interface messages.  But using
a routing socket has the advantage of being more portable.

  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
FreeBSD, The Power To Serve: http://www.freebsd.org/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Please review: patch to add FAST_IPSEC stats to netstat

2005-12-21 Thread Kelly Yancey


  Here is a patch to display stats gathered by the FAST_IPSEC stack:

http://people.freebsd.org/~kbyanc/netstat-fastipsec.diff

  If you have built your kernel with FAST_IPSEC, then without this
patch "netstat -s -p ipsec" displays nothing.  With this patch, it will
display the generic ipsec stats gathered by FAST_IPSEC (which are
different than the stats collected by the KAME stack).  In addition,
stats for the "esp", "ah", and "ipcomp" protocols are also supported.
  Originally, this functionality was added in-house by Matt Titus to 
FreeBSD 4.10 and I have ported it to -current as of today.  I've tried

to verify I didn't make any regressions to the -current version of
netstat in the merge process, but I would appreciate any review and/or
feedback I can get.  Barring any objections, I plan on committing this in 
1 week (on the 28th).  Thank you,


  Kelly

--
Kelly Yancey  -  [EMAIL PROTECTED],FreeBSD.org}  -  [EMAIL PROTECTED]
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"