Re: question about fopen fd limit

2016-12-24 Thread Alfred Perlstein

Hello 盛慧华

Here's another trick that may work.

Use funopen(3) and provide your own read/write/seek and close functions 
for the high fds.


You can basically make "cookie" a struct that contains your "int sized" fds.

 FILE *
 funopen(const void *cookie, int (*readfn)(void *, char *, int), 
int (*writefn)(void *, const char *, int),

 fpos_t (*seekfn)(void *, fpos_t, int), int (*closefn)(void *));


If you need more help please make sure to email me directly so I can see 
your question.


-Alfred


On 12/23/16 12:48 AM, 盛慧华 wrote:

hi all,

Thank you for your advice ~
solution 2  definitly broaden my horizons ~~but may be  not a good choice for 
my project ~~LoL
i will try to mail  freebsd-current mail list, if libc is as your description , 
may be i should modify it by myself ~~
Thank you so much~
Are u KingSoft's Dr Zhang ? nice to meet you !


 winson  sheng
  



winson sheng
  
From: Hongjiang Zhang

Date: 2016-12-23 11:44
To: 盛慧华; freebsd-net
Subject: RE: RE: question about fopen fd limit
Ok. I know.
There are two possible solutions:
Quick solution for short term: modify short to int in libc by yourself, 
buildworld and installworld. Pushing to modify libc may take a long time, 
especially only few people encounter this issue. You’d better send email to 
freebsd-current to confirm whether they accept your suggestion.
Work around: You can first reserve a series of fd before opening TCP 
connections. For example, invoke open(“/dev/null”) for 1 times to get 1 
fds. Those fd values are small enough to be held by “short”. After that, start 
TCP connections. Once you need to fopen a file, please call open(“xxx”) 
instead, and then use dup2(old_fd, new_fd) to exchange the two fd. The old_fd 
value is the one obtained by open(“xxx”), and new_fd is one in your reserved fd 
fields, and next please use fdopen(fd, mode). Here, you have to manage the 
reserved fds by yourself including open/close.
  
In my eyes:

is the quick method, and there is no modifications in your logic.
Needs you to maintain the reserved consecutive fields for fd by yourself, which 
increased the complexity of your logic.
  
Thanks

Hongjiang Zhang
  
From: 盛慧华 [mailto:hhsh...@corp.netease.com]

Sent: Friday, December 23, 2016 11:02 AM
To: Hongjiang Zhang ; freebsd-net 

Subject: Re: RE: question about fopen fd limit
  
hi all,
  
   not map  TCP to FILE, you misunderstanding my meaning~
  
   for example, if my server tcp already holds 32000 connection

   fopen only has 767 fd to use
  
   the problem has no bussiness with tcp fd, BUT fopen ...
  
   in some particular situlations , my server will open 1k+ FILE , that will exceed the fileno limit, and overflow occur

   my server can't open any file more ,that's the problem ~
  
   so i felt if bsd official could change FILE struct's fileno to a UNSIGNED SHORT that may be an effecient and convenient solution just for my case ?

   UNSIGNED SHORT fileno is enough for me, and i don't wanna change a lot of 
FILE function that take FILE * as its argument ~
   
   Thank you ~~~
  
 winson sheng
   
  



winson sheng
  
From: Hongjiang Zhang

Date: 2016-12-23 10:17
To: 盛慧华; freebsd-net
Subject: RE: question about fopen fd limit
Why do you need to map TCP fd to FILE?
  
It is difficult to modify FILE structure. If it is possible, let us figure out some new designs to meet your requirement.
  
-Original Message-

From: owner-freebsd-...@freebsd.org [mailto:owner-freebsd-...@freebsd.org] On 
Behalf Of ???
Sent: Thursday, December 22, 2016 11:57 PM
To: freebsd-net 
Subject: question about fopen fd limit
  
hi all,
  
hi~

we are from Chinese Game Develop Corp, Netease.
and One of our product using FreeBsd as its OS platform.
This Game has Millions of players online , and Each Server may holds 25000+ 
tcp connection at the same time.Thanks to BSD and kqueue :)
  
for example, it's one of our server , netstat cmd to list connections overall...

netstat -an | grep 13396 (it's our listening port) | wc -l
23221
  
 recently we do some performance optimize and promote this connect limit to 28000+ or 3+.

   But we find Freebsd has a limit that this huge online number will take 
28000+ fd, and bsd FILE * struct's
   fd only support to SHORT . such as ..
  
struct __sFILE {

...
short _file; /* (*) fileno, if Unix descriptor, else -1 */  ...
  
   so if our server want to fopen some file when we still hold this online number, the fd amount may easily exceed 32767, and fopen definitely return a err code. then the server will appear some fataly ERROR.
  
   we do a simple test and confirm this situation.
  
   then in fopen's code , we notice that we can use open to return a fd instread of fopen to avoid this overflow,

as below
  
68 /*

1 * File descriptors are a full int, but _file is only a short.
2 * If we

Re: Does FreeBSD have sendmmsg or recvmmsg system calls?

2016-01-27 Thread Alfred Perlstein



On 1/26/16 4:39 PM, Luigi Rizzo wrote:

On Tue, Jan 26, 2016 at 4:31 PM, Gary Jennejohn  wrote:

On Tue, 26 Jan 2016 17:46:52 -0500 (EST)
Daniel Eischen  wrote:


On Tue, 26 Jan 2016, Gary Jennejohn wrote:


On Tue, 26 Jan 2016 09:06:39 -0800
Luigi Rizzo  wrote:


On Tue, Jan 26, 2016 at 5:40 AM, Konstantin Belousov
 wrote:

On Mon, Jan 25, 2016 at 11:22:13AM +0200, Boris Astardzhiev wrote:

+ssize_t
+recvmmsg(int s, struct mmsghdr *__restrict msgvec, size_t vlen, int flags,
+const struct timespec *__restrict timeout)
+{
+ size_t i, rcvd;
+ ssize_t ret;
+
+ if (timeout != NULL) {
+ fd_set fds;
+ int res;

Please move all local definitions to the beginning of the function.

This style recommendation was from 30 years ago and is
bad programming practice, as it tends to complicate analysis
for the human and increase the chance of improper usage of
variables.

We should move away from this for new code.


Really?  I personally find having all variables grouped together
much easier to understand.  Stumbling across declarations in the
middle of the code in a for-loop, for example, takes me by surprise.

I also greatly dislike initializing variables in their declarations.

Maybe I'm just old fashioned since I have been writing C-code for
more than 30 years.

+1

Probably should be discouraged, but allowed on a case-by-case
basis.  One could argue that if you need to declaration blocks
in the middle of code, then that code is too complex and should
be broken out into a separate function.


Right.

And code like this

int func(void)
{
   int baz, zot;
   [some more code]
   if (zot < 5)
   {
 int baz = 3;
 [more code]
   }
   [some more code]
}

is even worse.  The compiler (clang) seems to consider this to
merely be a reinitialization of baz, but a human might be confused.

oh please... :)

This is simply an inner variable shadowing the outer one
(which is another poor practice, flagged with -Wshadow ).
When you exit the scope you get the external variable
with its value, as you can see from the following code.

   #include 
   int main(int ac, char *av[])
   {
 int baz = 5;
 printf("1 baz %d\n", baz);
 {
   int baz = 3;
   printf("2 baz %d\n", baz);
 }
 printf("3 baz %d\n", baz);
 return 0;
   }

I agree wholeheartedly with Luigi.   I am also surprised that shadowed 
variable warnings was not more widely understood.


It's time to move forward and make the code more readable and 
maintainable.  Having scoped variables just makes sense.  It's true that 
if you see very many of them, then it's likely time to introduce 
separate functions, but only in extreme cases, not on a case-by-case basis.


-Alfred
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern.ipc.sockbuf limits: anyone mind if I commit this?

2015-11-10 Thread Alfred Perlstein



On 11/10/15 3:13 PM, Adrian Chadd wrote:

hiya,

there's a PR with a patch:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204438

https://github.com/sparrc/freebsd/commit/157f90c55d1d54d33f41c6f7517de1a9c5f5e229

Does anyone know why setting the limits isn't as simple as this patch?

Does anyone mind if I just commit this?

Don't mind too heavily, however the old behavior is bad and confusing 
however at least it stops you, however the new behavior will be odd and 
incorrect without warning.


More succinctly: Silently "accepting" but actually changing the value 
passed in seems wrong.


It would seem the reason for the calculation is to actually limit the 
number of bytes of mbufs (not just data) to the max value?  Is that true?


Maybe it makes sense to export sb_max_adj via sysctl and allow setting 
of it instead?  Having silent clipping seems worse than an error.


-Alfred
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Idle connections via accept_filter(9)

2015-04-27 Thread Alfred Perlstein
This is over 15 years old. I currently don't know of a great solution to this 
problem. Might make sense to create a timer that runs and refs the socket that 
will occasionally fire and cleanse out the old connections. 

Shouldn't be that hard to do. 

Sent from my iPhone

> On Apr 27, 2015, at 9:19 AM, hiren panchasara  
> wrote:
> 
>> On 04/27/15 at 09:10P, Adrian Chadd wrote:
>> ask alfred? :)
> 
> Thanks! CCing him.
>> 
>> 
>> -a
>> 
>> 
>>> On 27 April 2015 at 02:22, hiren panchasara  
>>> wrote:
>>> Wanted to see if someone with understanding of accept_filter can
>>> comment.
>>> 
>>> cheers,
>>> Hiren
 On 04/09/15 at 09:08P, hiren panchasara wrote:
 If a connections comes on a socket with accf_data(9) (for example) but
 never sends any data, it'll occupy resources via staying forever in
 listen queue of partial unaccepted connections (socket->so_incomp) which
 can be seen as incqlen in 'netstat -Lan'.
 Kernel will never pass this connection down to the application as
 the filter criteria hasn't been met (no data) and application
 would never know about this connection.
 
 What I am not sure is what would be the state of the connection
 and state of the socket when in this situation. We do come here after
 finishing 3WHS but before handing this over to the application i.e.
 before the accept().
 
 From uipc_socket.c:
 
 * From the passive side, a socket is created with two queues of sockets:
 * so_incomp for connections in progress and so_comp for connections already
 * made and awaiting user acceptance.  As a protocol is preparing incoming
 * connections, it creates a socket structure queued on so_incomp by calling
 * sonewconn().  When the connection is established, soisconnected() is
 * called, and transfers the socket structure to so_comp, making it 
 available
 * to accept().
 
 So, it looks like the connection would be in ESTABLISHED state but
 socket would be stuck in the so_incomp queue. Other than this special
 condition of accpet_filter, can such a situation occur?
 
 Any insight/help into understanding this scenario and a way to cleanup
 these connections would be great.
 
 (I know tcp doesn't care/worry about idle sitting connections; we have
 keepalives to check the health of the connection but that's it, afaik)
 
 Cheers,
 Hiren
>> ___
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Implementing backpressure in the NFS server

2015-02-25 Thread Alfred Perlstein


On 2/25/15 5:08 PM, Garrett Wollman wrote:

Here's the scenario:

1) A small number of (Linux) clients run a large number of processes
(compute jobs) that read large files sequentially out of an NFS
filesystem.  Each process is reading from a different file.

2) The clients are behind a network bottleneck.

3) The Linux NFS client will issue NFS3PROC_READ RPCs (potentially
including read-ahead) independently for each process.

4) The network bottleneck does not serve to limit the rate at which
read RPCs can be issued, because the requests are small (it's only the
responses that are large).

5) Even if the responses are delayed, causing one process to block,
there are sufficient other processes that are still runnable to allow
more reads to be issued.

6) On the server side, because these are requests for different file
handles, they will get steered to different NFS service threads by the
generic RPC queueing code.

7) Each service thread will process the read to completion, and then
block when the reply is transmitted because the socket buffer is full.

8) As more reads continue to be issued by the clients, more and more
service threads are stuck waiting for the socket buffer until all of
the nfsd threads are blocked.

9) The server is now almost completely idle.  Incoming requests can
only be serviced when one of the nfsd threads finally manages to put
its pending reply on the socket send queue, at which point it can
return to the RPC code and pick up one request -- which, because the
incoming queues are full of pending reads from the problem clients, is
likely to get stuck in the same place.  Lather, rinse, repeat.

What should happen here?  As an administrator, I can certainly
increase the number of NFS service threads until there are sufficient
threads available to handle all of the offered load -- but the load
varies widely over time, and it's likely that I would run into other
resource constraints if I did this without limit.  (Is 1000 threads
practical? What happens when a different mix of RPCs comes in -- will
it livelock the server?)

I'm of the opinion that we need at least one of the following things
to mitigate this issue, but I don't have a good knowledge of the RPC
code to have an idea how feasible this is:

a) Admission control.  RPCs should not be removed from the receive
queue if the transmit queue is over some high-water mark.  This will
ensure that a problem client behind a network bottleneck like this one
will eventually feel backpressure via TCP window contraction if
nothing else.  This will also make it more likely that other clients
will still get their RPCs processed even if most service threads are
taken up by the problem clients.

b) Fairness scheduling.  There should be some parameter, configurable
by the administrator, that restricts the number of nfsd threads any
one client can occupy, independent of how many requests it has
pending.  A really advanced scheduler would allow bursting over the
limit for some small number of requests.

Does anyone else have thoughts, or even implementation ideas, on this?
The default number of threads is insanely low, the only reason I didn't 
bump them to FreeNAS levels (or higher) was because of the inevitable 
bikeshed/cryfest about Alfred touching defaults so I didn't bother.  I 
kept them really small, because y'know people whine, and they are capped 
at ncpu * 8, it really should be higher imo.


Just increase the nfs servers to something higher, I think we were at 
256 threads in FreeNAS and it did us just fine.  Higher seemed ok, 
except we lost a bit of performance.


The only problem you might see is on SMALL machines where people will 
complain.  So probably want an arch specific override or perhaps a 
memory based sliding scale.


If that could become a FreeBSD default (with overrides for small memory 
machines and arches) that would be even better.


I think your other suggestions are fine, however the problem is that:
1) they seem complex for an edge case
2) turning them on may tank performance for no good reason if the 
heuristic is met but we're not in the bad situation


That said if you want to pursue those options, by all means please do.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Adding new media types to if_media.h

2015-02-25 Thread Alfred Perlstein



On 2/25/15 5:11 PM, Gleb Smirnoff wrote:

On Mon, Feb 16, 2015 at 07:50:56PM -0600, Mike Karels wrote:
M> Well, I developed the prototype as I had planned, using a 64-bit media
M> word, and found that I got about 100 files in GENERIC that didn't compile;
M> they attempted to store "media words" in an int.  My kingdom for a typedef.
M> That didn't meet my goal of KPI compatibility, so I went to Plan B.
M>
M> Plan B is to steal an unused bit (RFU) to indicate an "extended" media
M> type.  I then used the variant/subtype field to store the extended type.
M> Effectively, the previously unused bit doubles the effective size of the
M> subtype  field.  Given that the previous 5-bit field lasted us 18 years,
M> I figured that doubling it would last a while.  I also changed the
M> SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended
M> types are all mapped to IFM_OTHER (31) using the old interface, but
M> are visible using the new one.
M>
M> With these changes, I modified one driver (vtnet) to use an extended type,
M> and the rest of GENERIC is happy.  The changes to ifconfig are also fairly
M> small.  The patch is appended, where email programs will screw it up,
M> or at ftp://ftp.karels.net/outgoing/if_media.patch.
M>
M> The VFAST subtype is a throw-away for testing.
M>
M> This seems like a reasonably pragmatic change to support the new 40 Gb/s
M> media types until someone wants to design an improved but non-backward-
M> compatible interface.  I think it meets the goal of suitability for
M> back-porting; it could be MFCed.

I will dare to vote against the crowd.

We can't and don't plan to preserve the driver KPI for the 11 branch. The
plan, that I hope to accomplish by 11 is to provide a driver KPI, where
drivers do not about struct ifnet, and other network stack stuff. Of
course, that's a huge change in KPI. But we do it for the sake to avoid
future changes.

So, all this tricks with one extra bit seem unnecessary to me. I'd suggest
to introduce new 'struct ifmedia' with enough space, and of course put extra
space in there. Give a new value to SIOCGIFMEDIA. Write a new clear code
to handle it, without any extended bit tricks.

For the sake of userland API, save old current 'struct ifmedia' as
'struct oifmedia', and take old value of ioctl to OSIOCIGIFMEDIA.
Write a function under BURN_BRIDGES that handles OSIOCIGIFMEDIA and
tries to convert from ifmedia to oifmedia,

To summarise: the patch adds tricks to just double the ifmedia name space,
not solving the problem forever. New API is introduced, but old limited one
doesn't have foreseable obsolete plan, since new is tied to it. All tricks
are performed for the sake of driver KPI stability, which isn't planned
to be kept for this major release cycle.


+1, rip the bandaid off.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [Differential] [Commented On] D1764: Factor out ip6_deletefraghdr()

2015-02-15 Thread Alfred Perlstein
Can you use the commit log string and try that? 

Sent from my iPhone

> On Feb 15, 2015, at 5:32 PM, glebius (Gleb Smirnoff) 
>  wrote:
> 
> glebius added a comment.
> 
> Damn f*ckbrikator doesn't allow me to close the revision, since I don't own 
> it.
> 
> Kristof, looks like you will need to manually close all your revisions as I 
> commit them. Or we can just leave some trash in this "pretty" software.
> 
> REVISION DETAIL
>  https://reviews.freebsd.org/D1764
> 
> To: kristof, ae, glebius
> Cc: ae, glebius, freebsd-net
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Fwd: Adding new media types to if_media.h

2015-02-08 Thread Alfred Perlstein


On 2/8/15 2:41 PM, Mike Karels wrote:


To solve the second problem, I think the right approach would be to reduce
this interface to a truly generic one, such as media type (e.g. Ethernet),
generic flags, and perhaps generic status.  Then there should be a separate
media-specific interface for each type, such as Ethernet and 802.11.  To a
small extent, we already have that.  Solving the second, more general problem,
requires a whole new driver KPI that will require surgery to every driver,
which is not an exercise that I would consider.


I am willing to do a prototype for -current for evaluation.

Comments, alternatives, ?

Mike,

I think we have enough people to chip in that your concern about 
breaking the KPI is not as bad as you think.


Would like to hear the first correct + long term + less hackish proposal 
first.


Norse has a kernel team that is heavily invested in networking that can 
help with the transition.


If done right, likely renaming ALL of the macros it will be quite 
trivial to catch all bad cases and move us forward in one great leap.


-Alfred

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Nasty bug in startup scripts with interface renaming.

2015-02-07 Thread Alfred Perlstein
If you happen to use interface renaming there is a nasty bug lurking in the 
startup scripts, it seems newly introduced, but I am unsure.

Specifically the following happens at boot time:

/etc/rc.d/netif is run without args.

It gets the list of interfaces and for each interface it calls network_start().

however in network start we have this:

# Create cloned interfaces
clone_up $cmdifn

# Rename interfaces.
ifnet_rename $cmdifn

# Configure the interface(s).
network_common ifn_start $cmdifn

Now it doesn't take that much to realize that if 'ifnet_rename' renames 
'cmdifn' then the subsequent call to 'network_common ifn_start $cmdifn' will be 
passing a stale interface in as a parameter and causes a bunch of errors to 
happen.

Example:
cmdifn="vtnet0"

Therefor:

# Rename interfaces.
ifnet_rename vtnet0 # <- gets renamed here to derp0

# Configure the interface(s).
network_common ifn_start vtnet0  # <- this seems to cause an error 
since we're using old name.


I looked at fixing ifnet_rename() to take a variable to assign to, so for 
instance the call could turn into something like:

ifnet_rename cmdifn  vtnet0  

This way cmdifn would be set to 'derp0' and subsequent stuff would work, 
however…. then I realized that ifnet_rename can take 0 args, or MULTIPLE args 
and will act on either all interfaces or the ones passed in.  So passing 
another var becomes a problem.

I then realized that if I threw together a patch to fix it "the alfred way" 
people would probably be upset.

So I'm asking, any suggestions before I go about just fixing this?

-Alfred



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: RFC: Enabling VIMAGE in GENERIC

2014-11-17 Thread Alfred Perlstein


On 11/17/14, 3:02 AM, Warner Losh wrote:

On Nov 17, 2014, at 12:46 AM, Craig Rodrigues  wrote:


Hi,

PROPOSAL
==
I would like to get feedback on the following proposal.
In the head branch (CURRENT), I would like to enable
VIMAGE with this commit:


PATCH
==

Index: sys/conf/NOTES
===
--- sys/conf/NOTES  (revision 274300)
+++ sys/conf/NOTES  (working copy)
@@ -784,8 +784,8 @@
device mn  # Munich32x/Falc54 Nx64kbit/sec cards.

# Network stack virtualization.
-#options   VIMAGE
-#options   VNET_DEBUG  # debug for VIMAGE
+optionsVIMAGE
+optionsVNET_DEBUG  # debug for VIMAGE

#
# Network interfaces:



I would like to enable VIMAGE for the following reasons:

REASONS


(1)  VIMAGE cannot be enabled off to the side in a separate library or
   kernel module.  When enabled, it is a kernel ABI incompatible change.
   This has impact on 3rd party code such as the kernel modules
   which come with VirtualBox.
   So the time to do it in CURRENT is now, otherwise we can't consider
   doing it until FreeBSD-12 timeframe, which is quite a while away.

(2)  VIMAGE is used in some  3rd party products, such as FreeNAS.
   These 3rd party products are mostly happy with VIMAGE,
   but sometimes they encounter problems, and FreeBSD doesn't
   see these problems because it is disabled by default.

(3)  Most of the major subsystems like ipfw and pf have been fixed for
VIMAGE, and the only
   way to shake out the last few issues is to make it the default and
   get feedback from the community.  ipfilter still needs to be
VIMAGE-ified.


(4)  Not everyone uses bhyve.  FreeBSD jails are an excellent virtualization
   platform for FreeBSD.  Jails are still very popular and
   performant.  VIMAGE makes jails even better by allowing per-jail
   network stacks.

(5)  Olivier Cochard-Labbe has provided good network performance results
   in VIMAGE vs. non-VIMAGE kernels:


https://lists.freebsd.org/pipermail/freebsd-net/2014-October/040091.html

(6)  Certain people like Vitaly "wishmaster"  have been
  running VIMAGE
  jails in a production environment for quite a while, and would like
to see it
  be the default.


ACTION PLAN
===

(1)  Coordinate/communicate with portmgr, since this has kernel ABI
implications

(2)  Work with clusteradm@, and try to get a test instance of one of the
   PF firewalls in the cluster working with a VIMAGE enabled kernel.

(3)   Take a pass through http://wiki.freebsd.org/VIMAGE/TODO
and
https://bugs.freebsd.org/bugzilla/buglist.cgi?quicksearch=vimage%20or%20vnet
 and try to clean things up.  Get help from net@ developers to do
this.

And if these don’t get cleaned up?
If they are not cleaned/stable up by 11-RELEASE then we turn it off.  
That is simple.





(4)   Take a pass on trying to VIMAGE-ify ipfilter.  I'll need help from
the ipfilter maintainers for this and some net@ developers.

And if this doesn’t happen?


Well we do have 2 other firewalls in the kernel to pick, but we do need 
VIMAGE so I will let you draw your own conclusions.


-Alfred

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: performance of the swtich/case statements

2014-10-30 Thread Alfred Perlstein
Please run compiler with -O2 -S to get the assembly to see what will 
actually happen.


thanks,
-Alfred

On 10/29/14 9:24 PM, bycn82 wrote:

Hi,
According to my understanding in Java programming, the compiler will
automatically store the values into a table and jump to the correct one
according to the value only when the condition values are in running
number,

for example.

swtich(a){
case 1:  code block 1
case 2:  code block 2
case 3:  code block 3
case 4:  code block 4
default: code block 5
}

it will be handled by an array
1-->code block 1
2-->code block 2
3-->code block 3
4-->code block 4
others-->code block 5

so when the value N is greater than  or lesser than 1, it will be directly
jump to the "code block 5"
otherwise, it will jump to N, because call the cases are nice in running
numbers,

but when the cases are messy, it will by just like lots of if/else


On Thu, Oct 30, 2014 at 6:30 AM, Erich Dollansky <
erichsfreebsdl...@alogt.com> wrote:


Hi,

On Wed, 29 Oct 2014 22:39:34 +0800
"bycn82"  wrote:


It is using the switch/case statement to make the code clear in the

I am not a C programmer, so I am not clear how the switch/case will be
optimized by the compiler in FreeBSD. But I used to write a compiler
by myself and I use a hash table to handle all the conditions in the
case statements because my compiler don't care about performance!,
But in C it is different, the case statement can only accept "int"
values, so I don't think it will use hash or what , it should be
directly use an array(), So whether it can be optimized it depends on
the conditions in the switch/case statements, and I noticed that the
cases statement in the 2 loops are not arranging the opcode in
running number, so does the compiler smart enough to optimize it?



I did not check recently. It was already a long, long time ago, that
compilers checked the limits and used the values as an index into a
table to jump to the code. I hope that this did not get changed.

With other words, the order in the code does not matter. The only
optimisation the compiler can do, is not to use a table if the
statement consists of a low number of entries only.

Erich


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Multipath TCP for FreeBSD v0.4

2014-09-17 Thread Alfred Perlstein
Github offers an excellent system with comments and all that jazz for 
making pull requests.


Super simple to use.

On 9/17/14 3:34 PM, Eric Joyner wrote:

As a random person without commit privileges, I hope so, too.

---
- Eric Joyner

On Wed, Sep 17, 2014 at 8:44 AM, Sean Bruno  wrote:


On Wed, 2014-09-17 at 12:58 +1000, Nigel Williams wrote:

On 17/09/14 08:48, Sean Bruno wrote:

On Mon, 2014-09-08 at 11:32 +1000, Nigel Williams wrote:

Hi,

We recently released a new tech report "Design Overview of

Multipath TCP

version 0.4 for FreeBSD-11" [1]. The report provides some details

on

various aspects of the implementation (session management,

data-level

retransmission etc), as of the most recent v0.4 patch [2].

cheers,
nigel

[1] http://caia.swin.edu.au/reports/140822A/CAIA-TR-140822A.pdf
[2] http://caia.swin.edu.au/urp/newtcp/mptcp/tools.html



Nigel:

Hi!  Are you folks interested in having this patchset incorporated

into

the main line of FreeBSD?  I'm open to putting up a phabricator

review

for you folks at https://reviews.freebsd.org if that's something you
guys want to do?

sean


Hi Sean,

Thanks, but I think it's too early to put it into phabricator. The
patch
releases thus far are early test previews for those who are
interested
and perhaps willing to play around with. So in short, it's not
production quality and not ready for committing to mainline.

I'll continue to announce these patches on the mailing list for the
time
being. I'm of course open to feedback/suggestions/questions and will
provide documentation with each release.

cheers,
nigel



Noted.  Thank you for the feedback.

I hope, that someday, https://reviews.freebsd.org becomes more of a code
review tool for users than it is being used for today.

sean

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: mbuf autotuning changes

2013-09-06 Thread Alfred Perlstein

On 9/6/13 12:10 PM, hiren panchasara wrote:

tunable_mbinit() in kern_mbuf.c looks like this:

119 /*
120  * The default limit for all mbuf related memory is 1/2 of all
121  * available kernel memory (physical or kmem).
122  * At most it can be 3/4 of available kernel memory.
123  */
124 realmem = qmin((quad_t)physmem * PAGE_SIZE,
125 vm_map_max(kmem_map) - vm_map_min(kmem_map));
126 maxmbufmem = realmem / 2;
127 TUNABLE_QUAD_FETCH("kern.ipc.maxmbufmem", &maxmbufmem);
128 if (maxmbufmem > realmem / 4 * 3)
129 maxmbufmem = realmem / 4 * 3;

If I am reading the code correctly, we loose the value on line 126 when we
do FETCH on line 127.

And after line 127, if we havent specified kern.ipc.maxmbufmem (in
loader.conf - I guess...), we set that value to 0.

And because of that the if condition on line 128 is almost always false?

What am I missing here?

Thanks,
Hiren
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

I think TUNABLE_*_FETCH will only write to the variable if it explicitly 
set.


Meaning, unless the user actually sets a value in loader.conf then 127 
is a no-op.


-Alfred

--
Alfred Perlstein

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: mbuf autotuning changes

2013-09-06 Thread Alfred Perlstein

On 9/6/13 12:36 PM, hiren panchasara wrote:

On Fri, Sep 6, 2013 at 12:14 PM, Alfred Perlstein  wrote:


On 9/6/13 12:10 PM, hiren panchasara wrote:


tunable_mbinit() in kern_mbuf.c looks like this:

119 /*
120  * The default limit for all mbuf related memory is 1/2 of all
121  * available kernel memory (physical or kmem).
122  * At most it can be 3/4 of available kernel memory.
123  */
124 realmem = qmin((quad_t)physmem * PAGE_SIZE,
125 vm_map_max(kmem_map) - vm_map_min(kmem_map));
126 maxmbufmem = realmem / 2;
127 TUNABLE_QUAD_FETCH("kern.ipc.**maxmbufmem", &maxmbufmem);
128 if (maxmbufmem > realmem / 4 * 3)
129 maxmbufmem = realmem / 4 * 3;

If I am reading the code correctly, we loose the value on line 126 when we
do FETCH on line 127.

And after line 127, if we havent specified kern.ipc.maxmbufmem (in
loader.conf - I guess...), we set that value to 0.

And because of that the if condition on line 128 is almost always false?

What am I missing here?

Thanks,
Hiren
__**_
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.freebsd.org/mailman/listinfo/freebsd-net>
To unsubscribe, send any mail to 
"freebsd-net-unsubscribe@**freebsd.org
"

  I think TUNABLE_*_FETCH will only write to the variable if it explicitly

set.

Meaning, unless the user actually sets a value in loader.conf then 127 is
a no-op.


Thanks Navdeep and Alfred.

Thats correct. Its not touching the var if its not set.

I guess the other TUNABLE_INT_FETCHs later in the function checking for
variable ==0 confused me. i.e. nmbclusters.

131 TUNABLE_INT_FETCH("kern.ipc.nmbclusters", &nmbclusters);
132 if (nmbclusters == 0)
133 nmbclusters = maxmbufmem / MCLBYTES / 4;

But those are global variable so here we are just checking if they are
explicitly set of not. If not, we will set them.

For maxmbufmem, we will set it to 1/2 the realmem. and if user sets it
explicitly than we will make sure its not more than 3/4 of the realmem.

Yes.  It's somewhat confusing.

I'm all for adding comments to this effect if you have the time and 
inclination.


-Alfred

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: LOCAL_CREDS are broken ?

2013-08-29 Thread Alfred Perlstein

On 8/29/13 11:48 AM, Yuri wrote:

The example below breaks with "Protocol not available"
But what is wrong? Isn't this the correct usage?
LOCAL_CREDS are only handled in kern/uipc_usrreq.c for AF_LOCAL, so it 
isn't clear why this doesn't work.


Yuri



--- example.c ---
#include 
#include 
#include 
#include 
#include 

main() {
  int sock;
  int error;
  int oval = 1;

  error = socket(AF_LOCAL, SOCK_SEQPACKET, 0);
  if (error == -1) {perror("socket"); exit(-1);}
  sock = error;

  error = setsockopt(sock, SOL_SOCKET, LOCAL_CREDS, &oval, sizeof(oval));
  if (error) {perror("setsockopt"); exit(-1);}
}



Looks like SOCK_SEQPACKET doesn't support LOCAL_CREDS because its 
protosw doesn't contain the entry for:

.pr_ctloutput = &uipc_ctloutput,

Have a look at src/sys/kern/uipc_usrreq.c at around lines 280-332:


static struct protosw localsw[] = {
{
.pr_type =  SOCK_STREAM,
.pr_domain =&localdomain,
.pr_flags = PR_CONNREQUIRED|PR_WANTRCVD|PR_RIGHTS,
.pr_ctloutput = &uipc_ctloutput,
.pr_usrreqs =   &uipc_usrreqs_stream
},
{
.pr_type =  SOCK_DGRAM,
.pr_domain =&localdomain,
.pr_flags = PR_ATOMIC|PR_ADDR|PR_RIGHTS,
.pr_ctloutput = &uipc_ctloutput,
.pr_usrreqs =   &uipc_usrreqs_dgram
},
{
.pr_type =  SOCK_SEQPACKET,
.pr_domain =&localdomain,

/*
 * XXXRW: For now, PR_ADDR because soreceive will bump into them
 * due to our use of sbappendaddr.  A new sbappend variants is 
needed

 * that supports both atomic record writes and control data.
 */
.pr_flags = PR_ADDR|PR_ATOMIC|PR_CONNREQUIRED|PR_WANTRCVD|
PR_RIGHTS,
.pr_usrreqs =   &uipc_usrreqs_seqpacket,
},
};


I wonder if this is just a bug/missing code!?

-Alfred



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"




--
Alfred Perlstein

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [rfc] migrate lagg to an rmlock

2013-08-24 Thread Alfred Perlstein

On 8/24/13 10:47 AM, Robert N. M. Watson wrote:

On 24 Aug 2013, at 17:36, Alfred Perlstein wrote:


We should distinguish "lock contention" from "line contention". When acquiring a rwlock 
on multiple CPUs concurrently, the cache lines used to implement the lock are contended, as they must bounce 
between caches via the cache coherence protocol, also referred to as "contention".  In the if_lagg 
code, I assume that the read-only acquire of the rwlock (and perhaps now rmlock) is for data stability rather 
than mutual exclusion -- e.g., to allow processing to completion against a stable version of the lagg 
configuration. As such, indeed, there should be no lock contention unless a configuration update takes place, 
and any line contention is a property of the locking primitive rather than data model.

There are a number of other places in the kernel where migration to an rmlock 
makes sense -- however, some care must be taken for four reasons: (1) while 
read locks don't experience line contention, write locking becomes observably 
e.g., rmlocks might not be suitable for tcbinfo; (2) rmlocks, unlike rwlocks, 
more expensive so is not suitable for all rwlock line contention spots -- 
implement reader priority propagation, so you must reason about; and (3) 
historically, rmlocks have not fully implemented WITNESS so you may get less 
good debugging output.  if_lagg is a nice place to use rmlocks, as 
reconfigurations are very rare, and it's really all about long-term data 
stability.

Robert, what do you think about a quick swap of the ifnet structures to counter 
before 10.x?

Could you be more specific about the proposal you're making?

Robert


The lagg patch referred to in the thread seems to indicate that zero 
locking is needed if we just switched to counter(9), that makes me 
wonder if we could do better with locking in other places if we switched 
to counter(9) while we have the chance.


This is the thread:

http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html

/

/>/Perfect solution would be to convert ifnet(9) to counters(9), but this
/>/requires much more work, and unfortunately ABI change, so temporarily
/>/patch lagg(4) manually.
/>/
/>/We store counters in the softc, and once per second push their values

/>/to legacy ifnet counters./



--
Alfred Perlstein

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [rfc] migrate lagg to an rmlock

2013-08-24 Thread Alfred Perlstein

On 8/24/13 7:16 AM, Robert Watson wrote:

On Sat, 24 Aug 2013, Alexander V. Chernikov wrote:


On 24.08.2013 00:54, Adrian Chadd wrote:


I'd like to commit this to -10. It migrates the if_lagg locking
from a rw lock to a rm lock. We see a bit of contention between the
transmit and


We're running lagg with rmlock on several hundred heavily loaded 
machines, it really works better. However, there should not be any 
contention between receive and transmit side since there is actually 
no _real_ need to lock RX (and even use lagg receive code at all):


http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html


We should distinguish "lock contention" from "line contention". When 
acquiring a rwlock on multiple CPUs concurrently, the cache lines used 
to implement the lock are contended, as they must bounce between 
caches via the cache coherence protocol, also referred to as 
"contention".  In the if_lagg code, I assume that the read-only 
acquire of the rwlock (and perhaps now rmlock) is for data stability 
rather than mutual exclusion -- e.g., to allow processing to 
completion against a stable version of the lagg configuration. As 
such, indeed, there should be no lock contention unless a 
configuration update takes place, and any line contention is a 
property of the locking primitive rather than data model.


There are a number of other places in the kernel where migration to an 
rmlock makes sense -- however, some care must be taken for four 
reasons: (1) while read locks don't experience line contention, write 
locking becomes observably e.g., rmlocks might not be suitable for 
tcbinfo; (2) rmlocks, unlike rwlocks, more expensive so is not 
suitable for all rwlock line contention spots -- implement reader 
priority propagation, so you must reason about; and (3) historically, 
rmlocks have not fully implemented WITNESS so you may get less good 
debugging output.  if_lagg is a nice place to use rmlocks, as 
reconfigurations are very rare, and it's really all about long-term 
data stability.


Robert
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



Robert, what do you think about a quick swap of the ifnet structures to 
counter before 10.x?


-Alfred

--
Alfred Perlstein

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Making IB a first class citizen.

2013-08-23 Thread Alfred Perlstein

On 8/23/13 2:29 PM, Vijay Singh wrote:

We've been running with this change at work for some time and it doesn't seem 
to be impacting performance at all. We have a statically routed environment 
though. Also if we really want to optimize for performance wrt routing then 
IMHO we need to bring back route caching to the tcpcb. Just a thought.

Thanks Vijay, I'll give a little more time and then push this change in.

-Alfred



Sent from my iPhone

On Aug 23, 2013, at 1:52 PM, Adrian Chadd  wrote:


.. should just check to see what impact it has on performance in the
general case. that may change the cache behaviour of the ARP / routing
table code.



-adrian



On 23 August 2013 09:50, Alfred Perlstein  wrote:


Hello -net.

This email is about making Infiniband a first class citizen of the FreeBSD
kernel.

Right now we have one #ifdef OFED in the src tree that makes compiling
modules a real challenge:

In sys/net/if_llatbl.h the "struct llentry" size changes based on if OFED
is compiled in or not, only by 16 bytes because Infiniband uses 20bytes for
MAC.  I am wondering if it would be OK to just unifdef this part to make
inifiband a first class citizen of the kernel. Otherwise maybe we can
reverse the ifdef so that it's WITHOUT_OFED and by default have it on.

I understand that we can not do this for FreeBSD 9.x due to breaking
network ABI, however I think we still have time to do so in FreeBSD 10.x.

If there's no objection I'd like to push this change into head in the next
day or two.  The only difference is +16 bytes to the "struct llentry".

Comments?

__**_
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.freebsd.org/mailman/listinfo/freebsd-net>
To unsubscribe, send any mail to 
"freebsd-net-unsubscribe@**freebsd.org
"

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Making IB a first class citizen.

2013-08-23 Thread Alfred Perlstein

Hello -net.

This email is about making Infiniband a first class citizen of the 
FreeBSD kernel.


Right now we have one #ifdef OFED in the src tree that makes compiling 
modules a real challenge:


In sys/net/if_llatbl.h the "struct llentry" size changes based on if 
OFED is compiled in or not, only by 16 bytes because Infiniband uses 
20bytes for MAC.  I am wondering if it would be OK to just unifdef this 
part to make inifiband a first class citizen of the kernel. Otherwise 
maybe we can reverse the ifdef so that it's WITHOUT_OFED and by default 
have it on.


I understand that we can not do this for FreeBSD 9.x due to breaking 
network ABI, however I think we still have time to do so in FreeBSD 10.x.


If there's no objection I'd like to push this change into head in the 
next day or two.  The only difference is +16 bytes to the "struct llentry".


Comments?

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: 9-STABLE: Chelsio t4nex0: failed to pre-process config file: 2.

2013-06-02 Thread Alfred Perlstein
This looks like the result of forgetting to include the actual firmware 
in the kernel config and/or the firmware device itself.


Can you check if you've included all the needed extra modules in the 
kernel config such as firmware(4) and the module for the card firmware 
itself?


A trick you can use is to run "kldstat" after loading the module, you'll 
see which additional modules were needed for the device to work.  
Unfortunately the kernel can't autoload those modules while booting.


I'm not sure if loader(8) picks up the deps either.

-Alfred


On 6/2/13 6:22 PM, John wrote:

Hi Folks,

I have a pair of Chelsio T4 cards installed in a new HP DL380
system. The driver does not load at boot time, failing with the
message:

t4nex0: failed to pre-process config file: 2.

After the system has finished booting, if I then issue a
'kldload if_cxgbe' command, the driver loads correctly. Note,
the driver loads correctly from the command prompt with or
without the if_cxgbe_load in /boot/loader.conf.

The message is coming from t4_main.c:partition_resources().
I don't see anything obvious that would cause this:

 rc = cfg ? upload_config_file(sc, cfg, &mtype, &maddr) : ENOENT;
 if (rc != 0) {
 mtype = FW_MEMTYPE_CF_FLASH;
 maddr = t4_flash_cfg_addr(sc);
 }
 
 bzero(&caps, sizeof(caps));

 caps.op_to_write = htobe32(V_FW_CMD_OP(FW_CAPS_CONFIG_CMD) |
 F_FW_CMD_REQUEST | F_FW_CMD_READ);
 caps.cfvalid_to_len16 = htobe32(F_FW_CAPS_CONFIG_CMD_CFVALID |
 V_FW_CAPS_CONFIG_CMD_MEMTYPE_CF(mtype) |
 V_FW_CAPS_CONFIG_CMD_MEMADDR64K_CF(maddr >> 16) | FW_LEN16(caps));
 rc = -t4_wr_mbox(sc, sc->mbox, &caps, sizeof(caps), &caps);
 if (rc != 0) {
 device_printf(sc->dev,
 "failed to pre-process config file: %d.\n", rc);
 return (rc);
 }

Has anyone run into this?

Thanks,
John

ps: And the output from loading the driver module by hand:

t4nex0:  mem 
0xf7cc-0xf7cf,0xf700-0xf77f,0xf6ff-0xf6ff1fff irq 26 at device 
0.4 on pci7
t4nex0: installing firmware 1.8.4.0 on card.
cxgbe0:  on t4nex0
cxgbe0: Ethernet address: 00:07:43:11:e9:00
cxgbe0: 16 txq, 8 rxq
cxgbe1:  on t4nex0
cxgbe1: Ethernet address: 00:07:43:11:e9:08
cxgbe1: 16 txq, 8 rxq
cxgbe2:  on t4nex0
cxgbe2: Ethernet address: 00:07:43:11:e9:10
cxgbe2: 16 txq, 8 rxq
cxgbe3:  on t4nex0
cxgbe3: Ethernet address: 00:07:43:11:e9:18
cxgbe3: 16 txq, 8 rxq
t4nex0: PCIe x8, 4 ports, 34 MSI-X interrupts, 101 eq, 33 iq
t4nex1:  mem 
0xfbcc-0xfbcf,0xfb00-0xfb7f,0xfaff-0xfaff1fff irq 58 at device 
0.4 on pci36
t4nex1: installing firmware 1.8.4.0 on card.
cxgbe4:  on t4nex1
cxgbe4: Ethernet address: 00:07:43:11:e6:a0
cxgbe4: 16 txq, 8 rxq
cxgbe5:  on t4nex1
cxgbe5: Ethernet address: 00:07:43:11:e6:a8
cxgbe5: 16 txq, 8 rxq
cxgbe6:  on t4nex1
cxgbe6: Ethernet address: 00:07:43:11:e6:b0
cxgbe6: 16 txq, 8 rxq
cxgbe7:  on t4nex1
cxgbe7: Ethernet address: 00:07:43:11:e6:b8
cxgbe7: 16 txq, 8 rxq
t4nex1: PCIe x8, 4 ports, 34 MSI-X interrupts, 101 eq, 33 iq




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Seeing EINVAL from writev on 8.0 to a non-blocking socket even though the data seems to hit the wire

2013-05-01 Thread Alfred Perlstein
On 5/1/13 8:03 PM, Richard Sharpe wrote:
> Hi folks,
>
> I am checking to see if there are any known bugs with respect to this
> in FreeBSD 8.0.
>
> Situation is that Samba 3.6.6 uses writev to a non-blocking socket to
> get the SMB2 requests on the wire.
>
> Intermittently, we see the writev return EINVAL even though the data
> has gotten on the wire. This I have verified by grabbing a capture and
> comparing the SMB Sequence number in the last outgoing packet on the
> wire vs the in-memory contents when we get EINVAL.
>
> Sometimes it occurs on a four-element IOVEC, sometimes we get EAGAIN
> on the four-element IOVEC and then we get EINVAL when retrying on a
> smaller IOVEC.
>
> Where should I look to check if there is some path where this might be
> happening? Is this even the correct mailing list?
>
What does the iovec look like when you get EINVAL? Can you sanity check
it? Is there anything special about it? (zero length vecs?)

I think there are a few "maxvals" that if overrun cause EINVAL to be
returned. example is if your iovec is somehow huge or has many, many
elements.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Is it possible to slow down the network interface?

2013-04-02 Thread Alfred Perlstein

On 4/2/13 4:25 PM, Yuri wrote:
For the testing purposes, I would like to be able to control the 
maximum speed of the interface.
There is this command 'ifconfig re0 media 10baseT/UTP' that is 
supposed to lower the speed to 10Mbps. However, it makes interface 
unusable on my system. All connections are broken, even the router had 
to be rebooted. Maybe this is the router issue.


Is there any other, "soft" way to change maximum interface speed to a 
particular value?
When somebody sends data too fast, OS sends back ICMP notifications 
that connection is jammed. My question is, is it possible to impose 
such condition artificially?
Is 'ifconfig re0 media 10baseT/UTP' actually supposed to work 
transparently, or disconnects are to be expected?




try dummynet, it lets you simulate slow or otherwise special networks.

man 4 dummynet

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Alfred Perlstein

On 2/11/13 3:10 AM, Andre Oppermann wrote:

On 09.02.2013 15:41, Alfred Perlstein wrote:
However, the end result must be far different than what has occurred 
so far.


If the code was deemed unacceptable for general inclusion, then we 
must find a way to provide a

light framework to accomplish the needs of the community member.


We've got pluggable congestion control modules thanks to lstewart.

You can implement any non-standard congestion control method by adding
your own module.  They can be compiled into the kernel or loaded as KLD.

I consider implementing this as a CC module the correct approach instead
of adding yet another sysctl.  Doing a CC module like this is very easy.


That sounds like a win.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-09 Thread Alfred Perlstein

On 2/7/13 12:04 PM, George Neville-Neil wrote:

On Feb 6, 2013, at 12:28 , Alfred Perlstein  wrote:


On 2/6/13 4:46 AM, John Baldwin wrote:

On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:

John:

A burst at line rate will *often* cause drops. This is because
router queues are at a finite size. Also such a burst (especially
on a long delay bandwidth network) cause your RTT to increase even
if there is no drop which is going to hurt you as well.

A SHOULD in an RFC says you really really really really need to do it
unless there is some thing that makes you willing to override it. It is
slight wiggle room.

In this I agree with Andre, we should not be *not* doing it. Otherwise
folks will be turning this on and it is plain wrong. It may be fine
for your network but I would not want to see it in FreeBSD.

In my testing here at home I have put back into our stack max-burst. This
uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
no more than 4 packets larger than your flight. All of my testing
high-bw-delay or lan has shown this to improve TCP performance. This
is because it helps you avoid bursting out so many packets that you overflow
a queue.

In your long-delay bw link if you do burst out too many (and you never
know how many that is since you can not predict how full all those
MPLS queues are or how big they are) you will really hurt yourself even worse.
Note that generally in Cisco routers the default queue size is somewhere between
100-300 packets depending on the router.

Due to the way our application works this never happens, but I am fine with
just keeping this patch private.  If there are other shops that need this they
can always dig the patch up from the archives.


This is yet another time when I'm sad about how things happen in FreeBSD.

A developer come forward with a non-default option that's very useful for some 
specific workloads, specifically one that contributes much time and $$$ to the 
project and the community rejects the patches even though it's been successful 
in other OSes.

It makes zero sense.

John, can you repost the patch?  Maybe there is a way to refactor this somehow 
so it's like accept filters where we can plug in a hook for TCP?

I am very disappointed, but not surprised.


I take away the complete opposite feeling.  This is how we work through these 
issues.
It's clear from the discussion that this need not be a default in the system,
and is a special case.  We had a reasoned discussion of what would be best to do
and at least two experts in TCP weighed in on the effect this change might have.

Not everything proposed by a developer need go into the tree, in particular 
since these
discussions are archived we can always revisit this later.

This is exactly how collaborative development should look, whether or not the 
patch
is integrated now, next week, next year, or ever.


I agree that discussion is great, we have all learned quite a bit from 
it, about TCP and the dangers of adjusting buffering without 
considerable thought.  I would not be involved in FreeBSD had this type 
of discussion and information not be discussed on the lists so readily.


However, the end result must be far different than what has occurred so far.

If the code was deemed unacceptable for general inclusion, then we must 
find a way to provide a light framework to accomplish the needs of the 
community member.


Take for instance someone who is starting a company that needs this 
facility.  Which OS will they choose?  One who has integrated a useful 
feature?  Or one who has rejected it and left that code in the mailing 
list archives?


As much as expert opinion is valuable, it must include understanding and 
need of handling special cases and the ability to facilitate those 
special cases for our users and developers.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-06 Thread Alfred Perlstein

On 2/6/13 4:46 AM, John Baldwin wrote:

On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:

John:

A burst at line rate will *often* cause drops. This is because
router queues are at a finite size. Also such a burst (especially
on a long delay bandwidth network) cause your RTT to increase even
if there is no drop which is going to hurt you as well.

A SHOULD in an RFC says you really really really really need to do it
unless there is some thing that makes you willing to override it. It is
slight wiggle room.

In this I agree with Andre, we should not be *not* doing it. Otherwise
folks will be turning this on and it is plain wrong. It may be fine
for your network but I would not want to see it in FreeBSD.

In my testing here at home I have put back into our stack max-burst. This
uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
no more than 4 packets larger than your flight. All of my testing
high-bw-delay or lan has shown this to improve TCP performance. This
is because it helps you avoid bursting out so many packets that you overflow
a queue.

In your long-delay bw link if you do burst out too many (and you never
know how many that is since you can not predict how full all those
MPLS queues are or how big they are) you will really hurt yourself even worse.
Note that generally in Cisco routers the default queue size is somewhere between
100-300 packets depending on the router.

Due to the way our application works this never happens, but I am fine with
just keeping this patch private.  If there are other shops that need this they
can always dig the patch up from the archives.


This is yet another time when I'm sad about how things happen in FreeBSD.

A developer come forward with a non-default option that's very useful 
for some specific workloads, specifically one that contributes much time 
and $$$ to the project and the community rejects the patches even though 
it's been successful in other OSes.


It makes zero sense.

John, can you repost the patch?  Maybe there is a way to refactor this 
somehow so it's like accept filters where we can plug in a hook for TCP?


I am very disappointed, but not surprised.

-Alfred


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: m_get2() name

2013-02-01 Thread Alfred Perlstein

On 2/1/13 7:04 AM, Gleb Smirnoff wrote:

   Hi!

   The m_get2() function allocates a single mbuf with enough space
to hold specified amount of data. It can return either a single mbuf,
an mbuf with a standard cluster, page size cluster, or jumbo cluster.

   It is alredy utilized in pfsync, bpf, libalias and soon to be utilized
in ieee802111. There are probably more places in stack where it can be used.

   The question is about its name. Once introduced, I just gave it name
"m_get2" to avoid discussion with myself about bikeshed colour and continue
hacking. Now it is getting used wider, and before we branch any stable branch
off the head, we have last chance to rename it to smth more meaningful.

   Any ideas on better name are welcome.


m_getbs - mbuf get buffer size.

conveniently also maps to:

m_getbs - mbuf get bike shed.

This is a cool function.  Maybe it should take an int*error arg as well 
for ENOBUFS/EINVAL?


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread Alfred Perlstein

On 1/30/13 12:29 PM, Andre Oppermann wrote:

On 30.01.2013 18:11, Alfred Perlstein wrote:

On 1/30/13 11:58 AM, John Baldwin wrote:

On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:


Yes, unfortunately I do object.  This option, combined with the 
inflated
CWND at the end of a burst, effectively removes much, if not all, 
of the
congestion control mechanisms originally put in place to allow 
multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or 
timeout
makes it even worse by doing this burst after an arbitrary amount 
of time
when network conditions and the congestion situation have certainly 
changed.
You have completely ignored the fact that Linux has had this as a 
global
option for years and the Internet has not melted.  A socket option 
is far more
fine-grained than their tunable (and requires code changes, not 
something a

random sysadmin can just toggle as "tuning").


I agree with John here.

While Andre's objection makes sense, since the majority of Linux/Unix 
hosts now have this as a
global option I can't think of why you would force FreeBSD to be a 
final holdout.


Unless OpenBSD, NetBSD, Solaris/Ilumos also support this it is hardly a
majority of Linux/Unix hosts.  And this isn't something a "sysadmin" 
should

tune at all.

My apologies, I should have been more clear.  I was speaking of majority 
of install base, not majority of distros.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread Alfred Perlstein

On 1/30/13 11:58 AM, John Baldwin wrote:

On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:


Yes, unfortunately I do object.  This option, combined with the inflated
CWND at the end of a burst, effectively removes much, if not all, of the
congestion control mechanisms originally put in place to allow multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or timeout
makes it even worse by doing this burst after an arbitrary amount of time
when network conditions and the congestion situation have certainly changed.

You have completely ignored the fact that Linux has had this as a global
option for years and the Internet has not melted.  A socket option is far more
fine-grained than their tunable (and requires code changes, not something a
random sysadmin can just toggle as "tuning").


I agree with John here.

While Andre's objection makes sense, since the majority of Linux/Unix 
hosts now have this as a global option I can't think of why you would 
force FreeBSD to be a final holdout.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-24 Thread Alfred Perlstein

On 1/24/13 11:14 AM, John Baldwin wrote:

On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote:

On 24.01.2013 03:31, Sepherosa Ziehau wrote:

On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin  wrote:

On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:

On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin  wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.
The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.

There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.

I think what you need is the RFC2861, however, you probably should
ignore the "application-limited period" part of RFC2861.

Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
disable it due to applictions having problems.  When it is disabled,
it doesn't decay the congestion window at all during idle handling.  That is,
it appears to act the same as if TCP_IGNOREIDLE were enabled.

  From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:

 tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 
2.6.18)
If enabled, provide RFC 2861 behavior and time out the 
congestion
window after an idle period.  An idle period is defined as the 
current
RTO (retransmission timeout).  If disabled, the congestion 
window will
not be timed out after an idle period.

Also, in this thread on tcp-m it appears no one on that list realizes that
there are any implementations which follow the "SHOULD" in RFC 2581 for idle
handling (which is what we do currently):

Nah, I don't think the idle detection in FreeBSD follows the
RFC2581/RFC5681 4.1 (the paragraph before the "SHOULD").  IMHO, that's
probably why the author in the following email requestioned about the
implementation of "SHOULD" in RFC2581/RFC5681.


http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html

So if we were to implement RFC 2861, the new socket option would be equivalent
to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket
basis rather than globally.

Agree, per-socket option could be useful than global sysctls under
certain situation.  However, in addition to the per-socket option,
could global sysctl nodes to disable idle_restart/idle_cwv help too?

No.  This is far too dangerous once it makes it into some tuning guide.
The threat of congestion breakdown is real.  The Internet, or any packet
network, can only survive in the long term if almost all follow the rules
and self-constrain to remain fair to the others.  What would happen if
nobody would respect the traffic lights anymore?

The problem with this argument is Linux has already had this as a tunable
option for years and the Internet hasn't melted as a result.
  

Besides that bursting into unknown network conditions is very likely to
result in burst losses as well.  TCP isn't good at recovering from it.
In the end you most likely come out ahead if you decay the restartCWND.

We have two cases primarily: a) long distance, medium to high RTT, and
wildly varying bandwidth (a.k.a. the Internet); b) short distance, low
RTT and mostly plenty of bandwidth (a.k.a. Datacenter).  The former
absolutely definately requires a decayed restartCWND.  The latter less
so but even there bursting at 10Gig TSO assisted wirespeed isn't going
to end too happy more often than not.

You forgot my case: c) dedicated long distance links with high bandwidth.


Since this seems to be a burning issue I'll come up with a patch in the
next days to add a decaying restartCWND that'll be fair and allow a very
quick ramp up if no loss occurs.

I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
is useful both with and without a decaying restartCWND?

Linux seems to be doing just fine with it for what seems to be a long 
while.  Can we get this committed?


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-22 Thread Alfred Perlstein

On 1/22/13 12:11 PM, John Baldwin wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.
The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.

There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.


This looks good, but it almost sounds like a bug for TCP to be doing 
this anyhow.


Why would one want this behavior?

Wouldn't it make sense to keep the window large until there was a 
problem rather than unconditionally chop it down?  I almost think TCP is 
afraid that you might wind up swapping out a 10gig interface for a 
modem?  I'm just not getting it.  (probably simple oversight on my part).


What do you think about also making this a sysctl for global on/off by 
default?


-Alfred



Index: share/man/man4/tcp.4
===
--- share/man/man4/tcp.4(revision 245742)
+++ share/man/man4/tcp.4(working copy)
@@ -205,6 +205,18 @@
  in the
  .Sx MIB Variables
  section further down.
+.It Dv TCP_IGNOREIDLE
+If a TCP connection is idle for more than one retransmit timeout,
+it enters slow start when new data is available to transmit.
+This avoids flooding the network with a full window of traffic at line rate.
+It also allows the connection to adjust to changes to network conditions
+that occurred while the connection was idle.  A connection that sends
+bursts of data separated by large idle periods can be permamently stuck in
+slow start as a result.
+The boolean option
+.Dv TCP_IGNOREIDLE
+disables the idle connection handling allowing connections to maintain the
+existing congestion window when restarting after an idle period.
  .It Dv TCP_NODELAY
  Under most circumstances,
  .Tn TCP
Index: sys/netinet/tcp_var.h
===
--- sys/netinet/tcp_var.h   (revision 245742)
+++ sys/netinet/tcp_var.h   (working copy)
@@ -230,6 +230,7 @@
  #define   TF_NEEDFIN  0x000800/* send FIN (implicit state) */
  #define   TF_NOPUSH   0x001000/* don't push */
  #define   TF_PREVVALID0x002000/* saved values for bad rxmit 
valid */
+#defineTF_IGNOREIDLE   0x004000/* connection is never idle */
  #define   TF_MORETOCOME   0x01/* More data to be appended to 
sock */
  #define   TF_LQ_OVERFLOW  0x02/* listen queue overflow */
  #define   TF_LASTIDLE 0x04/* connection was previously 
idle */
Index: sys/netinet/tcp_output.c
===
--- sys/netinet/tcp_output.c(revision 245742)
+++ sys/netinet/tcp_output.c(working copy)
@@ -206,7 +206,8 @@
 * to send, then transmit; otherwise, investigate further.
 */
idle = (tp->t_flags & TF_LASTIDLE) || (tp->snd_max == tp->snd_una);
-   if (idle && ticks - tp->t_rcvtime >= tp->t_rxtcur)
+   if (!(tp->t_flags & TF_IGNOREIDLE) &&
+   idle && ticks - tp->t_rcvtime >= tp->t_rxtcur)
cc_after_idle(tp);
tp->t_flags &= ~TF_LASTIDLE;
if (idle) {
Index: sys/netinet/tcp.h
===
--- sys/netinet/tcp.h   (revision 245823)
+++ sys/netinet/tcp.h   (working copy)
@@ -156,6 +156,7 @@
  #define   TCP_NODELAY 1   /* don't delay send to coalesce packets 
*/
  #if __BSD_VISIBLE
  #define   TCP_MAXSEG  2   /* set maximum segment size */
+#defineTCP_IGNOREIDLE  3   /* disable idle connection handling */
  #define TCP_NOPUSH4   /* don't push last block of write */
  #define TCP_NOOPT 8   /* don't use TCP options */
  #define TCP_MD5SIG16  /* use MD5 digests (RFC2385) */
Index: sys/netinet/tcp_usrreq.c
===
--- sys/netinet/tcp_usrreq.c(revision 245742)
+++ sys/netinet/tcp_usrreq.c(working copy)
@@ -1354,6 +1354,7 @@
  
  		case TCP_NODELAY:

   

Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

2013-01-14 Thread Alfred Perlstein

On 1/14/13 4:56 PM, John Baldwin wrote:

On Monday, January 14, 2013 4:42:16 pm Alfred Perlstein wrote:

Wouldn't a comment over the code suffice?

Something like your email as a header would actually work very nicely!

I think just using decimal would be more confusing than explicitly
calling it out like:

/* begin enumerated (not bitmask) socket option specifiers */
#define TCP_MAXSEG  0x02/* set maximum segment size */
#define TCP_NOPUSH  0x04/* don't push last block of write */
#define TCP_NOOPT   0x08/* don't use TCP options */
#define TCP_MD5SIG  0x10/* use MD5 digests (RFC2385) */
/* end enumerated socket option specifiers */

I have a patch I'll post next which will add a new option as '3'.  I think that
will make it more obvious and avoid having new options follow the old pattern.

Any objection to adding the contents of that email as a comment 
section?  It really would help.



-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks

2013-01-14 Thread Alfred Perlstein

Wouldn't a comment over the code suffice?

Something like your email as a header would actually work very nicely!

I think just using decimal would be more confusing than explicitly 
calling it out like:


/* begin enumerated (not bitmask) socket option specifiers */
#define TCP_MAXSEG  0x02/* set maximum segment size */
#define TCP_NOPUSH  0x04/* don't push last block of write */
#define TCP_NOOPT   0x08/* don't use TCP options */
#define TCP_MD5SIG  0x10/* use MD5 digests (RFC2385) */
/* end enumerated socket option specifiers */


On 1/14/13 3:50 PM, John Baldwin wrote:

The constants used for TCP and UDP socket options (TCP_NODELAY, etc.) are
currently defined as hex values that are individual bits.  However, socket
options are never masked together, they are used as a simple enumeration of
discrete values.  Using a bitmask forces us to run out of bits and makes it
harder for vendors to try to use a high range of values for local custom
options (hoping that they never conflict with a new option value added in
stock FreeBSD).

The socket options in  do use bitmasks for the low bits because
they map directly to bits so_options, but then they start a simple enumeration
at 0x1000.  TCP and UDP socket options do not directly map to bits in a flags
field in the PCB (e.g. TF_NODELAY != TCP_NODELAY).  I would like to change the
representation of the constants to be decimal instead of hex and encourage new
options to fill in the gaps between the existing values.  This would preserve
the existing ABI but keep things more sane in the future (I believe).  The
diff is this:

Index: netinet/tcp.h
===
--- netinet/tcp.h   (revision 245225)
+++ netinet/tcp.h   (working copy)
@@ -151,18 +151,18 @@
  /*
   * User-settable options (used with setsockopt).
   */
-#defineTCP_NODELAY 0x01/* don't delay send to coalesce packets 
*/
+#defineTCP_NODELAY 1   /* don't delay send to coalesce packets 
*/
  #if __BSD_VISIBLE
-#defineTCP_MAXSEG  0x02/* set maximum segment size */
-#define TCP_NOPUSH 0x04/* don't push last block of write */
-#define TCP_NOOPT  0x08/* don't use TCP options */
-#define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */
-#defineTCP_INFO0x20/* retrieve tcp_info structure */
-#defineTCP_CONGESTION  0x40/* get/set congestion control algorithm 
*/
-#defineTCP_KEEPINIT0x80/* N, time to establish connection */
-#defineTCP_KEEPIDLE0x100   /* L,N,X start keeplives after this 
period */
-#defineTCP_KEEPINTVL   0x200   /* L,N interval between keepalives */
-#defineTCP_KEEPCNT 0x400   /* L,N number of keepalives before 
close */
+#defineTCP_MAXSEG  2   /* set maximum segment size */
+#define TCP_NOPUSH 4   /* don't push last block of write */
+#define TCP_NOOPT  8   /* don't use TCP options */
+#define TCP_MD5SIG 16  /* use MD5 digests (RFC2385) */
+#defineTCP_INFO32  /* retrieve tcp_info structure */
+#defineTCP_CONGESTION  64  /* get/set congestion control algorithm 
*/
+#defineTCP_KEEPINIT128 /* N, time to establish connection */
+#defineTCP_KEEPIDLE256 /* L,N,X start keeplives after this 
period */
+#defineTCP_KEEPINTVL   512 /* L,N interval between keepalives */
+#defineTCP_KEEPCNT 1024/* L,N number of keepalives before 
close */
  
  #define	TCP_CA_NAME_MAX	16	/* max congestion control name length */
  
Index: netinet/udp.h

===
--- netinet/udp.h   (revision 245225)
+++ netinet/udp.h   (working copy)
@@ -48,7 +48,7 @@
  /*
   * User-settable options (used with setsockopt).
   */
-#defineUDP_ENCAP   0x01
+#defineUDP_ENCAP   1
  
  
  /*




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: FreeBSD boxes as a 'router'...

2012-11-20 Thread Alfred Perlstein

On 11/20/12 3:30 PM, Barney Cordoba wrote:


--- On Tue, 11/20/12, Ingo Flaschberger  wrote:


From: Ingo Flaschberger 
Subject: Re: FreeBSD boxes as a 'router'...
To: freebsd-net@freebsd.org
Date: Tuesday, November 20, 2012, 6:04 PM
Am 20.11.2012 23:49, schrieb Alfred
Perlstein:

On 11/20/12 2:42 PM, Jim Thompson wrote:

On Nov 20, 2012, at 3:52 PM, Barney Cordoba 

wrote:

You're entitled to your opinion, but experimental

results have tended to show yours incorrect.

Jim

Agree with Jim.  If you want pure packet

performance you burn a core to run a polling loop.

At new systems, without polling I had better performance and
no live-locks,
at old systems (Intel 82541GI) polling prevent live-locks.

Best test:
Loop a GigE Switch, inject a Packet and plug it into the
test-box.

Yeah, thats a good real-world test.

To me "performance" is not "burning a cpu" to get some extra pps.
Performance is not dropping buckets of packets. Performance is using
less cpu to do the same amount of work.

Is a machine that benchmarks at 998Mb/s at 95% cpu really a "higher
performance" system than one that does 970Mb/s and uses 50% of the cpu?

The measure of performance is to manage an entire load without dropping
any packets. If your machine goes into live-lock, then you need more
machine. Hacking it so that it drops packets is hardly a solution.

Any free CPU is wasted CPU.  (unless you're concerned about power 
consumption, then it's debatable).


-Alfred






___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: FreeBSD boxes as a 'router'...

2012-11-20 Thread Alfred Perlstein

On 11/20/12 2:42 PM, Jim Thompson wrote:

On Nov 20, 2012, at 3:52 PM, Barney Cordoba  wrote:


Anyone who even mentions polling should be discounted altogether. Polling
had value when you couldn't control the interrupt delays; but interrupt
moderation allows you to pace the interrupts any way you like without
the inefficiencies of polling.

You're entitled to your opinion, but experimental results have tended to show 
yours incorrect.

Jim
Agree with Jim.  If you want pure packet performance you burn a core to 
run a polling loop.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [CFT] ipfw SMP-ready dynamic states

2012-11-13 Thread Alfred Perlstein

Alexander, this is awesome.

On 11/13/12 11:28 AM, Alexander V. Chernikov wrote:

Hello list!

Currently most ipfw operations with dynamic states (keep-state, 
check-state, limit) are serialized via IPFW_DYN_LOCK() which is 
per-vnet mutex lock.


As a result, performance is limited to the same ~650kpps as in routing
(in several cases).

Patch changes the following:
* global lock is changed to per-bucket mutex
* state expiration is done in ipfw_tick every 1s. No expiration is 
done on forwarding path
* hash table resize is done automatically and does not cause all 
states to be lost


The only (architectural) problem I see is unlocked V_dyn_count 
increments.

So, we can do the following:
1) lock increments/decrements via some separate mutex
2) do nothing
3) take some combined approach:

Generally, we don't need value to be _exact_.
As a result, we count total number of states in every ipfw_tick run 
and set V_dyn_count to new value. New states still increment 
V_dyn_count unlocked.


What about using per-cpu PCPU counters, and then collecting them for 
display/reporting?


-Alfred




Performance:

Synthetic traffic, ipfw with single allow ip from any to any rule: 2.4M.
single keep-state ip from any to any: 2.2M.

Some more tests should be taken (with large number of states, 
different types of traffic, etc), maybe I can do some next week.



You need to run recent -current or merge r242631 and r242834 before 
applying this patch.



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-13 Thread Alfred Perlstein

On 11/13/12 12:25 AM, Andre Oppermann wrote:

On 13.11.2012 09:18, Alfred Perlstein wrote:

On 11/13/12 12:06 AM, Andre Oppermann wrote:

On 13.11.2012 07:45, Alfred Perlstein wrote:
If you are concerned about the space/time tradeoff I'm pretty happy 
with making it 1/2, 1/4th, 1/8th

the size of maxsockets.  (smaller?)

Would that work better?


I'd go for 1/8 or even 1/16 with a lower bound of 512.  More than
that is excessive.


I'm OK with 1/8.  All I'm really going for is trying to make it 
somewhat better than 512 when un-tuned.

>

PS: Please note that my patch for mbuf and maxfiles tuning is not yet
in HEAD, it's still sitting in my tcp_workqueue branch.  I still have
to search for derived values that may get totally out of whack with
the new scaling scheme.


This is cool!  Thank you for the feedback.

Would you like me to put this on a user branch somewhere for you to 
merge into your perf branch?


I can put it into my branch and also merge it to HEAD with
a "Submitted by: alfred" line.


Thank you, that works.  Note: it's not even compile tested at this point.

I should be able to do so tomorrow.

Are there other hashes to look at?  I noticed a few more:

UDBHASHSIZE
netinet/tcp_hostcache.c:#define TCP_HOSTCACHE_HASHSIZE  512
netinet/sctp_constants.h:#define SCTP_TCBHASHSIZE 1024
netinet/sctp_constants.h:#define SCTP_PCBHASHSIZE 256
netinet/tcp_syncache.c:#define TCP_SYNCACHE_HASHSIZE512

Any of these look like good targets?  I think most could be looked at.  
I've only glanced.  I can provide deltas.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-13 Thread Alfred Perlstein

On 11/13/12 12:06 AM, Andre Oppermann wrote:

On 13.11.2012 07:45, Alfred Perlstein wrote:

On 11/12/12 10:23 PM, Peter Wemm wrote:
On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein  
wrote:

On 11/12/12 10:04 PM, Alfred Perlstein wrote:

On 11/12/12 10:48 AM, Alfred Perlstein wrote:

On 11/12/12 10:01 AM, Andre Oppermann wrote:


I've already added the tunable "kern.maxmbufmem" which is in pages.
That's probably not very convenient to work with.  I can change it
to a percentage of phymem/kva.  Would that make you happy?

It really makes sense to have the hash table be some relation to 
sockets

rather than buffers.

If you are hashing "foo-objects" you want the hash to be some 
relation to
the max amount of "foo-objects" you'll see, not backwards derived 
from the

number of "bar-objects" that "foo-objects" contain, right?

Because we are hashing the sockets, right?   not clusters.

Maybe I'm wrong?  I'm open to ideas.


Hey Andre, the following patch is what I was thinking
(uncompiled/untested), it basically rounds up the maxsockets to a 
power of 2

and replaces the default 512 tcb hashsize.

It might make sense to make the auto-tuning default to a minimum 
of 512.


There are a number of other hashes with static sizes that could 
make use

of this logic provided it's not upside-down.

Any thoughts on this?

Tune the tcp pcb hash based on maxsockets.
Be more forgiving of poorly chosen tunables by finding a closer power
of two rather than clamping down to 512.
Index: tcp_subr.c
===


Sorry, GUI mangled the patch... attaching a plain text version.



Wait, you want to replace a hash with a flat array?  Why even bother
to call it a hash at that point?




If you are concerned about the space/time tradeoff I'm pretty happy 
with making it 1/2, 1/4th, 1/8th

the size of maxsockets.  (smaller?)

Would that work better?


I'd go for 1/8 or even 1/16 with a lower bound of 512.  More than
that is excessive.


I'm OK with 1/8.  All I'm really going for is trying to make it somewhat 
better than 512 when un-tuned.


The reason I chose to make it equal to max sockets was a space/time 
tradeoff, ideally a hash should
have zero collisions and if a user has enough memory for 250,000 
sockets, then surely they have

enough memory for 256,000 pointers.


I agree in general.  Though not all large memory servers do serve a
large amount of connections.  We have find a tradeoff here.

Having a perfect hash would certainly be laudable.  As long as the
average hash chain doesn't go beyond few entries it's not a problem.

If you strongly disagree then I am fine with a more conservative 
setting, just note that effectively
the hash table will require 1/2 the factor that we go smaller in 
additional traversals when we max
out the number of sockets.  Meaning if the table is 1/4 the size of 
max sockets, when we hit that
many tcp connections I think we'll see an order of average 2 linked 
list traversals to find a node.

At 1/8, then that number becomes 4.


I'm fine with that and claim that if you expect N sockets that you
would also increase maxfiles/sockets to N*2 to have some headroom.

That is a good point.


I recall back in 2001 on a PII400 with a custom webserver I wrote 
having a huge benefit by upping
this to 2^14 or maybe even 2^16, I forget, but suddenly my CPU went 
down a huge amount and I didn't

have to worry about a load balancer or other tricks.


I can certainly believe that.  A hash size of 512 is no good if
you have more than 4K connections.

PS: Please note that my patch for mbuf and maxfiles tuning is not yet
in HEAD, it's still sitting in my tcp_workqueue branch.  I still have
to search for derived values that may get totally out of whack with
the new scaling scheme.


This is cool!  Thank you for the feedback.

Would you like me to put this on a user branch somewhere for you to 
merge into your perf branch?


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-12 Thread Alfred Perlstein

On 11/12/12 10:23 PM, Peter Wemm wrote:

On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein  wrote:

On 11/12/12 10:04 PM, Alfred Perlstein wrote:

On 11/12/12 10:48 AM, Alfred Perlstein wrote:

On 11/12/12 10:01 AM, Andre Oppermann wrote:


I've already added the tunable "kern.maxmbufmem" which is in pages.
That's probably not very convenient to work with.  I can change it
to a percentage of phymem/kva.  Would that make you happy?


It really makes sense to have the hash table be some relation to sockets
rather than buffers.

If you are hashing "foo-objects" you want the hash to be some relation to
the max amount of "foo-objects" you'll see, not backwards derived from the
number of "bar-objects" that "foo-objects" contain, right?

Because we are hashing the sockets, right?   not clusters.

Maybe I'm wrong?  I'm open to ideas.


Hey Andre, the following patch is what I was thinking
(uncompiled/untested), it basically rounds up the maxsockets to a power of 2
and replaces the default 512 tcb hashsize.

It might make sense to make the auto-tuning default to a minimum of 512.

There are a number of other hashes with static sizes that could make use
of this logic provided it's not upside-down.

Any thoughts on this?

Tune the tcp pcb hash based on maxsockets.
Be more forgiving of poorly chosen tunables by finding a closer power
of two rather than clamping down to 512.
Index: tcp_subr.c
===


Sorry, GUI mangled the patch... attaching a plain text version.



Wait, you want to replace a hash with a flat array?  Why even bother
to call it a hash at that point?




If you are concerned about the space/time tradeoff I'm pretty happy with 
making it 1/2, 1/4th, 1/8th the size of maxsockets.  (smaller?)


Would that work better?

The reason I chose to make it equal to max sockets was a space/time 
tradeoff, ideally a hash should have zero collisions and if a user has 
enough memory for 250,000 sockets, then surely they have enough memory 
for 256,000 pointers.


If you strongly disagree then I am fine with a more conservative 
setting, just note that effectively the hash table will require 1/2 the 
factor that we go smaller in additional traversals when we max out the 
number of sockets.  Meaning if the table is 1/4 the size of max sockets, 
when we hit that many tcp connections I think we'll see an order of 
average 2 linked list traversals to find a node.  At 1/8, then that 
number becomes 4.


I recall back in 2001 on a PII400 with a custom webserver I wrote having 
a huge benefit by upping this to 2^14 or maybe even 2^16, I forget, but 
suddenly my CPU went down a huge amount and I didn't have to worry about 
a load balancer or other tricks.



-Alfred





___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-12 Thread Alfred Perlstein

On 11/12/12 10:04 PM, Alfred Perlstein wrote:

On 11/12/12 10:48 AM, Alfred Perlstein wrote:

On 11/12/12 10:01 AM, Andre Oppermann wrote:


I've already added the tunable "kern.maxmbufmem" which is in pages.
That's probably not very convenient to work with.  I can change it
to a percentage of phymem/kva.  Would that make you happy?



It really makes sense to have the hash table be some relation to 
sockets rather than buffers.


If you are hashing "foo-objects" you want the hash to be some 
relation to the max amount of "foo-objects" you'll see, not backwards 
derived from the number of "bar-objects" that "foo-objects" contain, 
right?


Because we are hashing the sockets, right?   not clusters.

Maybe I'm wrong?  I'm open to ideas.


Hey Andre, the following patch is what I was thinking 
(uncompiled/untested), it basically rounds up the maxsockets to a 
power of 2 and replaces the default 512 tcb hashsize.


It might make sense to make the auto-tuning default to a minimum of 512.

There are a number of other hashes with static sizes that could make 
use of this logic provided it's not upside-down.


Any thoughts on this?

Tune the tcp pcb hash based on maxsockets.
Be more forgiving of poorly chosen tunables by finding a closer power
of two rather than clamping down to 512.
Index: tcp_subr.c
===


Sorry, GUI mangled the patch... attaching a plain text version.


Index: tcp_subr.c
===
--- tcp_subr.c  (revision 242936)
+++ tcp_subr.c  (working copy)
@@ -235,7 +235,7 @@
  * variable net.inet.tcp.tcbhashsize
  */
 #ifndef TCBHASHSIZE
-#define TCBHASHSIZE512
+#define TCBHASHSIZE0
 #endif
 
 /*
@@ -282,6 +282,27 @@
return (0);
 }
 
+/*
+ * Take a value and get the next power of 2 that doesn't overflow.
+ * Used to size the tcp_inpcb hash buckets.
+ */
+static int
+maketcp_hashsize(int size)
+{
+   int hashsize;
+
+   /*
+* auto tune.
+* get the next power of 2 higher than maxsockets.
+*/
+   hashsize = 1 << fls(maxsockets);
+   /* catch overflow, and just go one power of 2 smaller */
+   if (hashsize < maxsockets) {
+   hashsize = 1 << (fls(maxsockets) - 1);
+   }
+   return hashsize;
+}
+
 void
 tcp_init(void)
 {
@@ -296,9 +317,20 @@
 
hashsize = TCBHASHSIZE;
TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &hashsize);
+   if (hashsize == 0) {
+   /* auto tune based on maxsockets */
+   hashsize = maketcp_hashsize(maxsockets);
+   }
+   /*
+* Be forgiving of admins that don't know to make the tunable
+* a power of two.
+*/
if (!powerof2(hashsize)) {
-   printf("WARNING: TCB hash size not a power of 2\n");
-   hashsize = 512; /* safe default */
+   int oldhashsize = hashsize;
+
+   hashsize = maketcp_hashsize(hashsize);
+   printf("%s: WARNING: TCB hash size not a power of 2, "
+   "fixed %d -> %d\n", __func__, oldhashsize, hashsize);
}
in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize,
"tcp_inpcb", tcp_inpcb_init, NULL, UMA_ZONE_NOFREE,
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: auto tuning tcp

2012-11-12 Thread Alfred Perlstein

On 11/12/12 10:48 AM, Alfred Perlstein wrote:

On 11/12/12 10:01 AM, Andre Oppermann wrote:


I've already added the tunable "kern.maxmbufmem" which is in pages.
That's probably not very convenient to work with.  I can change it
to a percentage of phymem/kva.  Would that make you happy?



It really makes sense to have the hash table be some relation to 
sockets rather than buffers.


If you are hashing "foo-objects" you want the hash to be some relation 
to the max amount of "foo-objects" you'll see, not backwards derived 
from the number of "bar-objects" that "foo-objects" contain, right?


Because we are hashing the sockets, right?   not clusters.

Maybe I'm wrong?  I'm open to ideas.


Hey Andre, the following patch is what I was thinking 
(uncompiled/untested), it basically rounds up the maxsockets to a power 
of 2 and replaces the default 512 tcb hashsize.


It might make sense to make the auto-tuning default to a minimum of 512.

There are a number of other hashes with static sizes that could make use 
of this logic provided it's not upside-down.


Any thoughts on this?

Tune the tcp pcb hash based on maxsockets.
Be more forgiving of poorly chosen tunables by finding a closer power
of two rather than clamping down to 512.
Index: tcp_subr.c
===
--- tcp_subr.c (revision 242936)
+++ tcp_subr.c (working copy)
@@ -235,7 +235,7 @@
  * variable net.inet.tcp.tcbhashsize
  */
 #ifndef TCBHASHSIZE
-#define TCBHASHSIZE 512
+#define TCBHASHSIZE 0
 #endif
 /*
@@ -282,6 +282,27 @@
  return (0);
 }
+/*
+ * Take a value and get the next power of 2 that doesn't overflow.
+ * Used to size the tcp_inpcb hash buckets.
+ */
+static int
+maketcp_hashsize(int size)
+{
+ int hashsize;
+
+ /*
+ * auto tune.
+ * get the next power of 2 higher than maxsockets.
+ */
+ hashsize = 1 << fls(maxsockets);
+ /* catch overflow, and just go one power of 2 smaller */
+ if (hashsize < maxsockets) {
+ hashsize = 1 << (fls(maxsockets) - 1);
+ }
+ return hashsize;
+}
+
 void
 tcp_init(void)
 {
@@ -296,9 +317,20 @@
  hashsize = TCBHASHSIZE;
  TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &hashsize);
+ if (hashsize == 0) {
+ /* auto tune based on maxsockets */
+ hashsize = maketcp_hashsize(maxsockets);
+ }
+ /*
+ * Be forgiving of admins that don't know to make the tunable
+ * a power of two.
+ */
  if (!powerof2(hashsize)) {
- printf("WARNING: TCB hash size not a power of 2\n");
- hashsize = 512; /* safe default */
+ int oldhashsize = hashsize;
+
+ hashsize = maketcp_hashsize(hashsize);
+ printf("%s: WARNING: TCB hash size not a power of 2, "
+ "fixed %d -> %d\n", __func__, oldhashsize, hashsize);
  }
  in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize,
  "tcp_inpcb", tcp_inpcb_init, NULL, UMA_ZONE_NOFREE,




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-12 Thread Alfred Perlstein

On 11/12/12 10:01 AM, Andre Oppermann wrote:

On 12.11.2012 18:43, Alfred Perlstein wrote:



On Nov 12, 2012, at 1:27 AM, Andre Oppermann  
wrote:



On 12.11.2012 09:52, Alfred Perlstein wrote:

On 11/11/12 11:28 PM, Andre Oppermann wrote:

On 12.11.2012 08:10, Alfred Perlstein wrote:

I noticed that TCBHASHSIZE does not autotune.

What do you think of the following algorithm?

Basically round down to next power of two based on nmbclusters / 64.


Please wait out for a real fix of the various mbuf-whatever tuning
issue I'll propose shortly.  This approach may become inapproriate.
Also the mbuf limits can be changed at runtime by sysctl.


What is the timeline you are asking for to wait?


http://svnweb.freebsd.org/changeset/base/242910


Very cool!

So instead of nmbclusters, will maxsockets work? Ideas/suggestions?


I've already added the tunable "kern.maxmbufmem" which is in pages.
That's probably not very convenient to work with.  I can change it
to a percentage of phymem/kva.  Would that make you happy?



It really makes sense to have the hash table be some relation to sockets 
rather than buffers.


If you are hashing "foo-objects" you want the hash to be some relation 
to the max amount of "foo-objects" you'll see, not backwards derived 
from the number of "bar-objects" that "foo-objects" contain, right?


Because we are hashing the sockets, right?   not clusters.

Maybe I'm wrong?  I'm open to ideas.

-Alfred




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-12 Thread Alfred Perlstein


On Nov 12, 2012, at 1:27 AM, Andre Oppermann  wrote:

> On 12.11.2012 09:52, Alfred Perlstein wrote:
>> On 11/11/12 11:28 PM, Andre Oppermann wrote:
>>> On 12.11.2012 08:10, Alfred Perlstein wrote:
>>>> I noticed that TCBHASHSIZE does not autotune.
>>>> 
>>>> What do you think of the following algorithm?
>>>> 
>>>> Basically round down to next power of two based on nmbclusters / 64.
>>> 
>>> Please wait out for a real fix of the various mbuf-whatever tuning
>>> issue I'll propose shortly.  This approach may become inapproriate.
>>> Also the mbuf limits can be changed at runtime by sysctl.
>>> 
>> What is the timeline you are asking for to wait?
> 
> http://svnweb.freebsd.org/changeset/base/242910

Very cool!

So instead of nmbclusters, will maxsockets work?  Ideas/suggestions?

-Alfred. 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: auto tuning tcp

2012-11-12 Thread Alfred Perlstein

On 11/11/12 11:28 PM, Andre Oppermann wrote:

On 12.11.2012 08:10, Alfred Perlstein wrote:

I noticed that TCBHASHSIZE does not autotune.

What do you think of the following algorithm?

Basically round down to next power of two based on nmbclusters / 64.


Please wait out for a real fix of the various mbuf-whatever tuning
issue I'll propose shortly.  This approach may become inapproriate.
Also the mbuf limits can be changed at runtime by sysctl.


What is the timeline you are asking for to wait?

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


auto tuning tcp

2012-11-11 Thread Alfred Perlstein

I noticed that TCBHASHSIZE does not autotune.

What do you think of the following algorithm?

Basically round down to next power of two based on nmbclusters / 64.

-Alfred

#include 
#include 
#include 


int
main(int argc, char **argv)
{
int nmbclusters;
int pow2cl;

nmbclusters = atoi(argv[1]);
pow2cl = 1 << (fls(nmbclusters / 64)-1);
if (pow2cl < 512)
pow2cl = 512;
printf("%d\n", pow2cl);
return (0);

}

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Patch for ip6_sprintf(), please review

2010-05-18 Thread Alfred Perlstein
Thank you Doug,

I will be committing this shortly.

* Doug Barton  [100516 12:21] wrote:
> Someone at work has been reading
> http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation :)
> 
> This change follows the rules in that draft which will become and RFC as
> soon as it finishes winding its way through the process, so I am
> supportive of the change you are proposing.
> 
> 
> Doug
> 
> On 5/15/2010 11:22 PM, Alfred Perlstein wrote:
> > Hello,
> > 
> > The following patch seems appropriate to apply
> > to fix the kernel ip6_sprintf() function.
> > 
> > What it is doing is ensuring that when we
> > abbreviate addresses that the longest string
> > of zeros is shortend, not the first run of
> > zeros.
> > 
> > Our internal commit log is:
> > problem:
> > Unification of IPv6 address representation
> > fix:
> > recommended format of text representing an IPv6 address
> > is summarized as follows.
> > 
> > 1. omit leading zeros
> > 
> > 2. "::" used to their maximum extent whenever possible
> > 
> > 3. "::" used where shortens address the most
> > 
> > 4. "::" used in the former part in case of a tie breaker
> > 
> > 5. do not shorten one 16 bit 0 field
> > 
> > 6. use lower case
> > 
> > Present code in ip6_sprintf() is following rules 1,2,5,6.
> > Adding fix for following other rules also.For following
> > rules 3 and 4, finding out the index where to replace zero's
> > with '::' and using that index.
> > References:
> > http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-04.html
> > 
> > 
> > Diff is attached in text format.
> > 
> > 
> > 
> > 
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 
> 
> 
> -- 
> 
>   ... and that's just a little bit of history repeating.
>   -- Propellerheads
> 
>   Improve the effectiveness of your Internet presence with
>   a domain name makeover!http://SupersetSolutions.com/
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

-- 
- Alfred Perlstein
.- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10
.- FreeBSD committer
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Patch for ip6_sprintf(), please review

2010-05-18 Thread Alfred Perlstein
* Hiroki Sato  [100517 22:43] wrote:
> Alfred Perlstein  wrote
>   in <20100516062211.gc6...@elvis.mu.org>:
> 
> al> The following patch seems appropriate to apply
> al> to fix the kernel ip6_sprintf() function.
> al>
> al> What it is doing is ensuring that when we
> al> abbreviate addresses that the longest string
> al> of zeros is shortend, not the first run of
> al> zeros.
> (snip)
> al> Diff is attached in text format.
> 
>  I think the code is correct and reasonable for commit.

Ok, I will do some final checks and commit shortly.

Thank you,
-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Patch for ip6_sprintf(), please review

2010-05-15 Thread Alfred Perlstein
Hello,

The following patch seems appropriate to apply
to fix the kernel ip6_sprintf() function.

What it is doing is ensuring that when we
abbreviate addresses that the longest string
of zeros is shortend, not the first run of
zeros.

Our internal commit log is:
problem:
Unification of IPv6 address representation
fix:
recommended format of text representing an IPv6 address
is summarized as follows.

1. omit leading zeros

2. "::" used to their maximum extent whenever possible

3. "::" used where shortens address the most

4. "::" used in the former part in case of a tie breaker

5. do not shorten one 16 bit 0 field

6. use lower case

Present code in ip6_sprintf() is following rules 1,2,5,6.
Adding fix for following other rules also.For following
rules 3 and 4, finding out the index where to replace zero's
with '::' and using that index.
References:
http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-04.html


Diff is attached in text format.

-- 
- Alfred Perlstein
.- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10
.- FreeBSD committer
Index: in6.c
===
--- in6.c	(revision 207329)
+++ in6.c	(working copy)
@@ -61,7 +61,7 @@
  */
 
 #include 
-__FBSDID("$FreeBSD$");
+__FBSDID("$FreeBSD: head/sys/netinet6/in6.c 207268 2010-04-27 09:47:14Z kib $");
 
 #include "opt_compat.h"
 #include "opt_inet.h"
@@ -1898,7 +1898,7 @@
 char *
 ip6_sprintf(char *ip6buf, const struct in6_addr *addr)
 {
-	int i;
+	int i, cnt = 0, maxcnt = 0, idx = 0, index = 0;
 	char *cp;
 	const u_int16_t *a = (const u_int16_t *)addr;
 	const u_int8_t *d;
@@ -1907,6 +1907,23 @@
 	cp = ip6buf;
 
 	for (i = 0; i < 8; i++) {
+		if (*(a + i) == 0) {
+			cnt++;
+			if (cnt == 1)
+idx = i;
+		}
+		else if (maxcnt < cnt) {
+			maxcnt = cnt;
+			index = idx;
+			cnt = 0;
+		}
+	}
+	if (maxcnt < cnt) {
+		maxcnt = cnt;
+		index = idx;
+	}
+
+	for (i = 0; i < 8; i++) {
 		if (dcolon == 1) {
 			if (*a == 0) {
 if (i == 7)
@@ -1917,7 +1934,7 @@
 dcolon = 2;
 		}
 		if (*a == 0) {
-			if (dcolon == 0 && *(a + 1) == 0) {
+			if (dcolon == 0 && *(a + 1) == 0 && i == index) {
 if (i == 0)
 	*cp++ = ':';
 *cp++ = ':';
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Can't start mysql in jail

2009-05-25 Thread Alfred Perlstein
* Miroslav Lachman <000.f...@quip.cz> [090525 10:27] wrote:
> Sam Wun wrote:
> >Hi,
> >
> >This seems a common question, but it is a bit different.
> >Production OS: FreeBSD 6.2
> >Source OS: FreeBSD 7.2
> >
> >I created a jailed mysql 5.1 in my source OS FreeBSD 7.2, and then tar
> 
> As you can see, there is different libc.so version, different threading 
> library, etc.
> 
> So you can't run MySQL daemon build on different major version OS.

You should be able to provided that you install the compat
libraries.  You may also need to use the libmap.conf
facility to fixup threading library to point to libthr but I
am unsure.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ipv6 bugfix, need review.

2008-12-23 Thread Alfred Perlstein
* Doug Barton  [081223 11:46] wrote:
> On Mon, 22 Dec 2008, Alfred Perlstein wrote:
> 
> >Hey guys, we found a bug at Juniper and it resolves an issue
> >for us.  I've been asked to forward this to FreeBSD, I honestly
> >am not that clear on the issue so I'm hoping someone can step
> >up to review this.
> >
> >Synopsis is:
> >
> > The traffic class byte is set to 0x in the header of some
> > BGP packets sent between interfaces that have IPv6 addresses,
> > instead of the correct setting 0xc0 (INTERNETCONTROL).
> >
> >Fix is small and attached.  One thing I am wondering, do we
> >need to check "if (inp)" ?  I don't think so.
> 
> How about adding an assert to the patch to prove this theory? :)
> 
> I'll test it on my home box (which has IPv6) as soon as I'm done with the 
> stuff I'm working on atm.
> 
> 
> hth,
> 
> Doug

Thanks Doug, will do.

Please let me know results.  do you know how to test if this is
actually being excersized?  I guess you could add a sysctl that
gets incremented each time this codepath is hit to test?


-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


ipv6 bugfix, need review.

2008-12-22 Thread Alfred Perlstein
Hey guys, we found a bug at Juniper and it resolves an issue
for us.  I've been asked to forward this to FreeBSD, I honestly
am not that clear on the issue so I'm hoping someone can step
up to review this.

Synopsis is:

  The traffic class byte is set to 0x in the header of some
  BGP packets sent between interfaces that have IPv6 addresses,
  instead of the correct setting 0xc0 (INTERNETCONTROL).

Fix is small and attached.  One thing I am wondering, do we
need to check "if (inp)" ?  I don't think so.

Index: bsd/sys/netinet/tcp_syncache.c
===
RCS file: /cvs/junos-2008/bsd/sys/netinet/tcp_syncache.c,v
retrieving revision 1.24
diff -p -u -r1.24 tcp_syncache.c
--- bsd/sys/netinet/tcp_syncache.c  29 Jul 2008 17:07:43 -  1.24
+++ bsd/sys/netinet/tcp_syncache.c  16 Dec 2008 19:23:31 -
@@ -1271,6 +1271,7 @@ syncache_respond(sc, m)
struct inpcb *inp;
 #ifdef INET6
struct ip6_hdr *ip6 = NULL;
+   int inp_tclass;
 #endif
struct rt_nexthop *minmtu_nh;
struct route_table *rtb = NULL;
@@ -1387,6 +1388,12 @@ syncache_respond(sc, m)
/* ip6_hlim is set after checksum */
ip6->ip6_flow &= ~IPV6_FLOWLABEL_MASK;
ip6->ip6_flow |= sc->sc_flowlabel;
+   /* Set the TC for IPv6 just like TOS for IPv4 */
+   ip6->ip6_flow &= ~IPV6_CLASS_MASK;
+   if (inp) {
+   inp_tclass = IPV6_GET_CLASS(inp->in6p_flowinfo);
+   ip6->ip6_flow |= IPV6_SET_CLASS(inp_tclass);
+   }
 
th = (struct tcphdr *)(ip6 + 1);
} else


-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ACE and FreeBSD

2008-12-22 Thread Alfred Perlstein
* Randall Stewart  [081222 03:48] wrote:
> Hi all:
> 
> I am trying to get the latest ACE/TAO toolkit compiling with Head...  
> (the
> port is marked broken in 7)..
> 
> In the process of fixing things I found something I am not sure how
> to approach.. for now I have just ifdef'd it out but maybe someone
> can point me to the right method...
> 
> They are using a ioctl -- SIOCGIFDATA -- to get access to the interface
> packet counts and such. Now near as I can tell we don't have that
> SIO. A google of someone a few years ago where the question was
> asked turned up a, we don't need that instead we should have
> access to this information via the sysctl.
> 
> So my immediate thought, hey netstat does this.. and it probably uses
> the sysctl... so I go and look at the code.. and tada.. it does a
> kread() to get the actual if_data  yuck.
> 
> So, is there a sysctl that gets access to this information? I have
> poked around in a sysctl -a -N and don't see anything that looks
> promising..
> 
> Pointers to the right approach would be appreciated.. I am not sure
> what the monitor stuff is used for.. but I would like to get this
> toolkit fully functional if possible :-)

You could expand SIOCGIFDATA, but you'd need to make a compat
SIOCGIFODATA (OLD DATA) ioctl.  Or you could export it maybe
through the dev sysctl tree.  I like the former.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: working directory within kernel code

2008-12-16 Thread Alfred Perlstein
* Ferner Cilloniz  [081216 12:33] wrote:
> I am trying to determine the current working directory when a system
> call is issued. im interested in determining this from a kernel module.
> 
> however, because system calls are only given a thread* and a void*,
> which gets casted, is there any way i find out the cwd?

thread should point to proc which should have a "current dir" vnode
in it, or a pointer to a struct that has it... keep poking around.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Timers in drivers vs userland

2008-10-20 Thread Alfred Perlstein
Have you tried using rtprio?

You'll have to be really careful though so as not to jam up the
system using it.

-Alfred

* Len Gross <[EMAIL PROTECTED]> [081018 17:28] wrote:
> Slight correction; I should have said more accurate usleep, not "timer."
> 
> -- Len
> 
> On Sat, Oct 18, 2008 at 3:12 PM, Len Gross <[EMAIL PROTECTED]> wrote:
> > If I place a timer directly in a driver (like Ethernet)  will it be
> > subject to less jitter and more consistency than if it were in
> > Userland?
> >
> > I know FreeBSD is not "real time," but I need to be able to run a
> > polling algorithm with about 1 ms accuracy.
> >
> > Thanks in advance.
> >
> > (Please tell me if there is a better list for this question.)
> >
> > -- Len
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Closing connection from an accept_filter(9)

2008-10-20 Thread Alfred Perlstein
* David DeSimone <[EMAIL PROTECTED]> [081018 02:25] wrote:
> Eugene M. Kim <[EMAIL PROTECTED]> wrote:
> >
> > Is it possible to close a connection from an accept filter, for
> > example, in order to prevent an incoming connection with a malformed
> > request body from ever reaching the userland?
> 
> How would you propose to find out what is in the request body without
> first accepting the connection?

By writing a custom accept filter! :)

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Closing connection from an accept_filter(9)

2008-10-17 Thread Alfred Perlstein
* Eugene M. Kim <[EMAIL PROTECTED]> [081017 17:58] wrote:
> Hello,
> 
> Is it possible to close a connection from an accept filter, for example, 
> in order to prevent an incoming connection with a malformed request body 
> from ever reaching the userland?

Probably, look at what happens inside of syncache or syncookies
to sockets that are on accept queue but not yet "accepted".

-Alfred



> 
> Cheers,
> Eugene
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Question regarding NFS

2008-09-24 Thread Alfred Perlstein
* Adam Stylinski <[EMAIL PROTECTED]> [080918 17:15] wrote:
> Hello,
>   I am running an IPCop firewall for my entire network.  I have a
> wireless network device on the blue subnet which must access a freebsd NFS
> server.  In order to do this, I need to open a DMZ pinhole on a few select
> ports.  It's my understanding that NFS chooses random ports and I was
> wondering if there was a way I could fix this.  There is a good reason that
> the subnet for the wireless is separate from the wired and I'd rather not
> configure this thing over a VPN.  The client connecting to the NFS server is
> a voyage computer (pretty much a small debian).  Also, if at all possible,
> I'd like to keep performance reasonably high when large volumes of clients
> are connecting to the NFS server, I'm not sure if binding to one port may or
> may not make this impossible.  I apologize for my stupidity and lack of
> understanding when it comes to NFS.  Any help would be gladly appreciated,
> guys.

_usually_ NFS uses port 2049 on the server side.  I think the client may
bind to a random low port, this would be annoying to change, but could
be done with a kernel hack relatively easily.  Look at the code in
src/sys/nfsclient/nfs_socket.c, there's some code that that deals with
binding sockets that you can play with.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: too many open file descriptors messages since bind 9.4.2-P1 (port dns94)

2008-07-15 Thread Alfred Perlstein
FWIW, the userland scan of the files is not nearly as bad as
what happens in the kernel when hundreds or thousands of objects
are accessed that blow out the cache, oh and the locking that
occurs as well.

* Peter Jeremy <[EMAIL PROTECTED]> [080715 16:43] wrote:
> On 2008-Jul-15 16:09:17 -0700, Bakul Shah <[EMAIL PROTECTED]> wrote:
> >IIRC, when poll() returns n, you only look at the first n
> >values in the pollfd array so it is a win when you expect a
> >very small number of fds to be ready.  In the select case you
> >have to test the bit array until you see the last ready fd.
> 
> No.  Both poll(2) and select(2) return the number of FDs ready for
> I/O.  You need to scan the pollfd or fd_set array until you find that
> many FDs ready.
> 
> poll(2) is a win if you only need to test a small number of FDs
> compared to the number of FDs that the process has open.  In the case
> of bind, you have a large number of FDs to test, of which you are
> only expecting a very small number to be ready - if you don't
> treat fd_set as opaque, select(2) allows you to quickly skip large
> (roughly wordsize) chunks of un-interesting FDs.
> 
> Note that, based on sys_generic.c in 7.x and -CURRENT, poll(2) is
> limited to checking FD_SETSIZE descriptors, whilst select(2) has
> no upper limit.
> 
> -- 
> Peter Jeremy
> Please excuse any delays as the result of my ISP's inability to implement
> an MTA that is either RFC2821-compliant or matches their claimed behaviour.



-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD network stack Vs others

2008-02-04 Thread Alfred Perlstein
* ithilgore -- <[EMAIL PROTECTED]> [080204 06:59] wrote:
>  I 'd like to learn what are the basic differences ( pros and cons ) between
> the
> FreeBSD network stack and the other OSs' ( especially linux )
> 
> I know that linux has had everything rewritten from scratch as far as the
> implementation of tcp-ip and the sockets are concerned and would like to
> know if this has made it actually more robust or state-of-the-art than
> FreeBSD's or the opposite.
> 
> Some actual technical details and references would be appreciated.

Linux's stack hasn't been rewritten from the BSD one, it was written
from scratch.

Linux's tcp/ip stack has been rewritten many times over the years
with the promise of large performance gains.

The fact of the matter is that the performance on the "bleeding
edge" of both systems, FreeBSD and Linux, is about the same.

>From a BSD proponent's perspective, I would take the pragmatic
viewpoint that everytime Linux reinvents its stack to get performance
or some other feature FreeBSD isn't far behind with a relatively
minor change to its stack to accomplish the same feat.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Alfred Perlstein
* David G Lawrence <[EMAIL PROTECTED]> [071221 23:31] wrote:
> > > > Can you use a placeholder vnode as a place to restart the scan?
> > > > you might have to mark it special so that other threads/things
> > > > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > > > restart point.
> > > 
> > >That was one of the solutions that I considered and rejected since it
> > > would significantly increase the overhead of the loop.
> > >The solution provided by Kostik Belousov that uses uio_yield looks like
> > > a find solution. I intend to try it out on some servers RSN.
> > 
> > Out of curiosity's sake, why would it make the loop slower?  one
> > would only add the placeholder when yielding, not for every iteration.
> 
>Actually, I misread your suggestion and was thinking marker flag,
> rather than placeholder vnode. Sorry about that. The current code
> actually already uses a marker vnode. It is hidden and obfuscated in
> the MNT_VNODE_FOREACH macro, further hidden in the __mnt_vnode_first/next
> functions, so it should be safe from vnode reclaimation/free problems.

That level of obscuring is a bit worrysome.

Yes, I did mean placeholder vnode.

Even so, is it of utility or not?

Or is it already being used and I'm missing something and should
just "utsl" at this point?

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Alfred Perlstein
* David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote:
> > >Unfortunately, the version of the patch that I sent out isn't going to
> > > help your problem. It needs to yield at the top of the loop, but vp isn't
> > > necessarily valid after the wakeup from the msleep. That's a problem that
> > > I'm having trouble figuring out a solution to - the solutions that come
> > > to mind will all significantly increase the overhead of the loop.
> > 
> > I apologize for not reading the code as I am swamped, but a technique
> > that Matt Dillon used for bufs might work here.
> > 
> > Can you use a placeholder vnode as a place to restart the scan?
> > you might have to mark it special so that other threads/things
> > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > restart point.
> 
>That was one of the solutions that I considered and rejected since it
> would significantly increase the overhead of the loop.
>The solution provided by Kostik Belousov that uses uio_yield looks like
> a find solution. I intend to try it out on some servers RSN.

Out of curiosity's sake, why would it make the loop slower?  one
would only add the placeholder when yielding, not for every iteration.



-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Alfred Perlstein
* David G Lawrence <[EMAIL PROTECTED]> [071219 09:12] wrote:
> > >Try it with "find / -type f >/dev/null" to duplicate the problem  
> > >almost
> > >instantly.
> > 
> > I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> > trigger the problem.  I'm working getting a test running with
> > David's ffs_sync() workaround now, adding a few counters there should
> > get this narrowed down a little more.
> 
>Unfortunately, the version of the patch that I sent out isn't going to
> help your problem. It needs to yield at the top of the loop, but vp isn't
> necessarily valid after the wakeup from the msleep. That's a problem that
> I'm having trouble figuring out a solution to - the solutions that come
> to mind will all significantly increase the overhead of the loop.
>As a very inadequate work-around, you might consider lowering
> kern.maxvnodes to something like 2 - that might be low enough to
> not trigger the problem, but also be high enough to not significantly
> affect system I/O performance.

I apologize for not reading the code as I am swamped, but a technique
that Matt Dillon used for bufs might work here.

Can you use a placeholder vnode as a place to restart the scan?
you might have to mark it special so that other threads/things
(getnewvnode()?) don't molest it, but it can provide for a convenient
restart point.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bikeshed for all!

2007-12-12 Thread Alfred Perlstein
* Julian Elischer <[EMAIL PROTECTED]> [071212 15:13] wrote:
> Alfred Perlstein wrote:
> >try using "instance".
> >
> >"Oh I'm going to use the FOO routing instance."
> 
> what do Juniper call it?

"Instance" and "vrf".

-Alfred
> 
> >
> >Works nicely.
> >
> >* Julian Elischer <[EMAIL PROTECTED]> [071212 14:34] wrote:
> >>So, I'm playing with some multiple routing table support..
> >>the first version is a minimal impact version with very limited 
> >>functionality.
> >>It's done that way so I can put it in RELENG_6/7 without breaking ABIs (I 
> >>hope).
> >>Later there will be a more flexible version for-current.
> >>
> >>Here's the question..
> >>
> >>I need a word to use to describe the network view one is currently on..
> >>e.g. if you are usinghe second routing table, you could say I've set xxx 
> >>to 1
> >>(0 based)..
> >>
> >>
> >>current;y in my code I'm using 'universe' but I don't like that..
> >>
> >>one could think of it as a routing plane..
> >>each routing plane has he same interfaces on it but they are logically 
> >>treated differently becasue each plane has a different routing table.
> >>
> >>
> >>so here's an axample of  it in use now...
> >>the names should change...
> >>
> >>setuniverse 1 netstat -rn
> >>[shows table 1]
> >>setuniverse 2 route add 10.0.0.0/24 192.168.2.1
> >>setuinverse 1 route add 10.0.0.0/24 192.168.3.1
> >>setuniverse 2 route -n get 10.0.0.3
> >>[shows 192.168.2.1]
> >>setuniverse 1 route -n get 10.0.0.3
> >>[shows 192.168.3.1]
> >>setuniverse 2 start_apache
> >>[appache starts, always using 192.168.2.1 to reach the 10.0.0 net.
> >>
> >>
> >>also the syscall is setuniverse()
> >>
> >>so, you see I really need a better name
> >>setrtab?
> >>
> >>rtab? rtbl?
> >>
> >>and the command should be called ""
> >>
> >>
> >>___
> >>freebsd-net@freebsd.org mailing list
> >>http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >>To unsubscribe, send any mail to "[EMAIL PROTECTED]"
> >
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bikeshed for all!

2007-12-12 Thread Alfred Perlstein
* Mike Silbersack <[EMAIL PROTECTED]> [071212 15:09] wrote:
> 
> On Wed, 12 Dec 2007, Julian Elischer wrote:
> 
> >So, I'm playing with some multiple routing table support..
> >the first version is a minimal impact version with very limited 
> >functionality.
> >It's done that way so I can put it in RELENG_6/7 without breaking ABIs (I 
> >hope).
> >Later there will be a more flexible version for-current.
> >
> >Here's the question..
> >
> >I need a word to use to describe the network view one is currently on..
> >e.g. if you are usinghe second routing table, you could say I've set xxx 
> >to 1
> >(0 based)..
> 
> In the spirit of your subject, why not call them 'sheds'?

Because it's horrible. :)

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bikeshed for all!

2007-12-12 Thread Alfred Perlstein
* Peter Wood <[EMAIL PROTECTED]> [071212 14:53] wrote:
> > so, you see I really need a better name
> > setrtab?
> >
> > rtab? rtbl?
> >
> > and the command should be called ""
> 
> Would "vrf" (Virtual Routing and Forwarding) be to technical? From 
> experience Cisco's call it vrf, Junipers use routing-instance IIRC.

Yes, Juniper calls it "instance", although, I'm quite sure I've
heard "vrf" said over the cubes here.  

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bikeshed for all!

2007-12-12 Thread Alfred Perlstein
try using "instance".

"Oh I'm going to use the FOO routing instance."

Works nicely.

* Julian Elischer <[EMAIL PROTECTED]> [071212 14:34] wrote:
> So, I'm playing with some multiple routing table support..
> the first version is a minimal impact version with very limited 
> functionality.
> It's done that way so I can put it in RELENG_6/7 without breaking ABIs (I 
> hope).
> Later there will be a more flexible version for-current.
> 
> Here's the question..
> 
> I need a word to use to describe the network view one is currently on..
> e.g. if you are usinghe second routing table, you could say I've set xxx to 
> 1
> (0 based)..
> 
> 
> current;y in my code I'm using 'universe' but I don't like that..
> 
> one could think of it as a routing plane..
> each routing plane has he same interfaces on it but they are logically 
> treated differently becasue each plane has a different routing table.
> 
> 
> so here's an axample of  it in use now...
> the names should change...
> 
> setuniverse 1 netstat -rn
> [shows table 1]
> setuniverse 2 route add 10.0.0.0/24 192.168.2.1
> setuinverse 1 route add 10.0.0.0/24 192.168.3.1
> setuniverse 2 route -n get 10.0.0.3
> [shows 192.168.2.1]
> setuniverse 1 route -n get 10.0.0.3
> [shows 192.168.3.1]
> setuniverse 2 start_apache
> [appache starts, always using 192.168.2.1 to reach the 10.0.0 net.
> 
> 
> also the syscall is setuniverse()
> 
> so, you see I really need a better name
> setrtab?
> 
> rtab? rtbl?
> 
> and the command should be called "????"
> 
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Switch pfil(9) to rmlocks

2007-11-26 Thread Alfred Perlstein
* Robert Watson <[EMAIL PROTECTED]> [071126 12:37] wrote:
> 
> On Fri, 23 Nov 2007, Max Laier wrote:
> 
> >attached is a diff to switch the pfil(9) subsystem to rmlocks, which are 
> >more suited for the task.  I'd like some exposure before doing the switch, 
> >but I don't expect any fallout.  This email is going through the patched 
> >pfil already - twice.
> 
> FYI, since people are experimenting with rmlocks as a substitute for 
> rwlocks, I played with moving the global rwlock used to protect the name 
> space and linkage of UNIX domain sockets to be an rmlock.  Kris didn't see 
> any measurable change in performance for his MySQL benchmarks, but I 
> figured I'd post the patches as they give a sense of what change impact 
> things like reader state management have on code.  Attached below.  I have 
> no current plans to commit these changes as they appear not to offer 
> benefit (either because the rwlock overhead was negigible compared to other 
> costs in the benchmark, or because the read/write blend was too scewed 
> towards writes -- I think probably the former rather than the latter).

I would track the read/write lock mix to get an idea of what the
ratio is.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: accept filters and zero copy sockets

2007-10-19 Thread Alfred Perlstein
* Jonathan Noack <[EMAIL PROTECTED]> [071018 20:59] wrote:
> I'm in the process of upgrading my web/database/nfs/jack-of-all-trades box
> from 6.2 to RELENG_7.  I figured now would be a good time to clean up my
> kernel config files.  I have the following in my old kernel config:
> 
> # Statically Link in accept filters
> options   ACCEPT_FILTER_DATA
> options   ACCEPT_FILTER_HTTP
> 
> # Zero copy sockets support.  This enables "zero copy" for sending and
> # receiving data via a socket.  The send side works for any type of NIC,
> # the receive side only works for NICs that support MTUs greater than the
> # page size of your architecture and that support header splitting.  See
> # zero_copy(9) for more details.
> options   ZERO_COPY_SOCKETS
> 
> Are these options still working/recommended?  With all the changes to
> networking over the years (this box was originally set up during the 4.x
> days and has been upgraded many times) I have no idea if these are still
> good things to have.

Accept filters should certainly work, otherwise someone will get
some noogies...

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Too many TIME_WAIT connections

2007-10-01 Thread Alfred Perlstein
* Jamie Ostrowski <[EMAIL PROTECTED]> [071001 16:02] wrote:
>Hello -
> 
>I've got a mailserver running FreeBSD 4.11 and Sendmail 8.13 that has
> been running as a mailserver for a couple of years without any
> load/connection problems. Here are my memory stats:
> Mem: 71M Active, 265M Inact, 96M Wired, 24M Cache, 60M Buf, 36M Free
> Swap: 2048M Total, 760K Used, 2047M Free
> 
> Then all of a sudden we started experiencing dropped connections even though
> the load average is generally around 2.0 or less.
> 
>   I found the problem today: there are currently 1300 socket connections
> suspended at status TIME_WAIT on the incoming smtp port.
> 
>   I checked some of my kernel settings:
> 
>   kern.ipc.somaxconn = 128
>   net.inet.tcp.msl: 3
> 
>   I suspect this is a dos attack: they're just opening these connections,
> and then let them hang there and they don't close them, so they just build
> up and the machine rejects new connections.
> 
>   Based on my configuration, does anyone have some suggestions on how I
> might tweak the system to overcome this (apparent?) DOS attack?

You can tweak msl, but it probably makes more sense to use some form
of firewall, ipfw, ipfilter, pf, etc on the box.

you can use netstat to see the remote addresses, just block them.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Quagga as border router

2007-09-20 Thread Alfred Perlstein
* Yuri Lukin <[EMAIL PROTECTED]> [070920 16:49] wrote:
> On Thu, 20 Sep 2007 00:24:09 -0700, Alfred Perlstein wrote
> > 
> > Juniper is based on FreeBSD. ;-)
> > 
> 
> On old code from the 4.x days I think, right?

In the current release, yes.

Would you like a router based on 5.x? :)

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Quagga as border router

2007-09-20 Thread Alfred Perlstein
* Steve Bertrand <[EMAIL PROTECTED]> [070919 21:14] wrote:
> >>> Essentially, I'd like a board with at *least* 6 PCI-X slots, and perhaps
> >>> 8 RAM slots (if I can find justification that my router will work better
> >>> with up to 16GB of memory).
> > 
> > Why would you go with PCI-X? it's slow and getting end-of life..
> > 
> > go for PCI-Express.
> > there are quad PCI-E gigabit cards available.
> > Much lower packet latency.
> 
> As per my last email to Sten and the list...
> 
> I'm not a hardware person. PCI-E, PCI-X, I don't know the difference.
> 
> It was assumed that others would understand what I wanted and be able to
> make recommendations to me, and correct me on my terminology.
> 
> All I do know is that there is something more than ISA slots, and 386's
> now ;)
> 
> My request wasn't for clarification on motherboard technicalities, it
> was essentially a request on a recommendation for a hardware/software
> platform based on FreeBSD, that could possibly replace a Cisco 7206-VXR
> based on the NPE-G2 processing engine (or equivalent).

Juniper is based on FreeBSD. ;-)

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD discarding received packets > MTU

2007-09-07 Thread Alfred Perlstein
* David Christensen <[EMAIL PROTECTED]> [070907 13:41] wrote:
> > > I'm not completely opposed to making such a change, but I don't want
> > > to make a default change in the driver's behavior that other people 
> > > may be depending upon (whether they are aware of it or not).  A
> > > tunable driver value could be the answer but I'm not entirely sure
> > > how it would fare in the hardware at the high end of MTU 
> > values such 
> > > as 9000.
> > 
> > Dave:
> > 
> > Internet ettiquette demands being gracious in what you accept.
> > The default policy of FreeBSD is to accept such packets.
> > This is a really weird bug to track down.
> > Other drivers support it.
> > 
> > This isn't worth making a stand over, unless you're trying
> > to hold users of YOUR driver hostage.
> > 
> 
> I'm just being cautious about making changes before I understand
> all of the implications.  The driver's current behavior is
> supported by IEEE 802.3 specification (802.3-2005, 4.2.4.2.1)
> and is implemented in the same way for other operating systems
> that are very widely deployed (including Windows and Linux)
> without any reported problems.  The existing bge driver which
> was developed for FreeBSD 10 years ago also operates this way,
> so all of my references for porting this driver happen to agree
> on the same implementation.

Which is all well and good, but the age of a bug does not a feature
make.

Please think of the four points I raised.

I think it makes sense to possibly add a "enforce rx mtu" knob
somewhere, but it should likely be turned off.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD discarding received packets > MTU

2007-09-07 Thread Alfred Perlstein
* David Christensen <[EMAIL PROTECTED]> [070907 10:48] wrote:
> > > It could certainly be argued by some that Cisco is not standards
> > > compliant in this case for sending an oversized Ethernet frame
> > > and expecting everyone to accept it.  Hardware has limitations
> > > and assuming that all Ethernet controllers can support frames
> > > greater than 1522 bytes is not reasonable.  Fortunately there is
> > > a suitable workaround which is setting a larger MTU for the 
> > > interface.  What size do you use?  How did you arrive at that
> > > value?
> > 
> > I use 1550 to make it work in the test harness.
> > 
> > The trouble is that if I set the mtu to 1550, and the machine 
> > talks to another
> > such machine with it's mtu also set to 1550 then they 
> > negotiate a maximum sized
> > packet based on 1550, and the problem hits me again. This is 
> > a web proxy 
> > and that problem occurs when there are two layers of proxy 
> > and one proxy talks to 
> > another. I really just need it to to silently accept a packet some 
> > 32 bytes or so larger than the stated MTU.
> > 
> > I see no reason for the driver to not do what the em driver 
> > does and allow 
> > itself to receive any packet up to the MCLBYTES size.
> > 
> > We only hit this problem recently because the data interfaces on our
> > devices are usually em NICs and we only just recently started 
> > allowing the 
> > users to use the built in (on DELL 2950) bce interfaces for 
> > this purpose.
> > 
> 
> I'm not completely opposed to making such a change, but I don't want
> to make a default change in the driver's behavior that other people 
> may be depending upon (whether they are aware of it or not).  A
> tunable driver value could be the answer but I'm not entirely sure
> how it would fare in the hardware at the high end of MTU values such 
> as 9000.

Dave:

Internet ettiquette demands being gracious in what you accept.
The default policy of FreeBSD is to accept such packets.
This is a really weird bug to track down.
Other drivers support it.

This isn't worth making a stand over, unless you're trying
to hold users of YOUR driver hostage.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


take II: Allocating AF constants for vendors.

2007-09-06 Thread Alfred Perlstein
* Alfred Perlstein <[EMAIL PROTECTED]> [070821 14:13] wrote:
> Hello all,
> 
> I would like to reserve about 64 entries for VENDOR specific address
> families in sys/socket.h.
> 
> I think this will allow vendors to comfortably use the array of
> address families without worrying about overlap with FreeBSD
> protocols.
> 
> If no one objects I plan to commit this in the next few days.
> 
> The format will be along the lines of:
> 
> AF_VENDOR0 -> AF_VENDOR63
> 
> Suggestions?

Sam asked that I provide some numbers for this proposal, I have
them, however in the meanwhile another proposal I've floated
was implementing a reservation system where FreeBSD would allocate
every even number in the AF_ set of constants and leave the
odd numbers for vendors.

Q: "What if a vendor wants to then contribute code to FreeBSD?"
A: They should have asked FreeBSD to reserve a number, now they
   can allocate a FreeBSD one.

The numbers are specifically meant for internal address families.

Here's the numbers for simply bumping AF_MAX:

Here's what I have for sizing it up 59 entries. 


===
GDB commands to get sizes of structures related to AF_MAX:

printf "AF_MAX: %d\n", sizeof(((struct ifnet *)0)->if_afdata) / sizeof(void*)
printf "struct netexport: %d\n", sizeof(struct netexport)
printf "struct ifnet: %d\n", sizeof(struct ifnet)
printf "route.c:rt_tables: %d\n", sizeof(rt_tables)

===
Data from AF_MAX = 37 (FreeBSD-stable)

AF_MAX: 37
Kernel size:
/usr/src/sys/i386/compile/JUNIPER_6_2_SMP % size kernel.debugSMALLMAX
   textdata bss dec hex filename
5964450  791752  367916 7124118  6cb496 kernel.debugSMALLMAX

/usr/src/sys/i386/compile/JUNIPER_6_2_SMP % gdb kernel.debugSMALLMAX 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...
(gdb) printf "AF_MAX: %d\n", sizeof(((struct ifnet *)0)->if_afdata) / 
sizeof(void*)
AF_MAX: 37
(gdb) printf "struct netexport: %d\n", sizeof(struct netexport)
struct netexport: 316
(gdb) printf "struct ifnet: %d\n", sizeof(struct ifnet)
struct ifnet: 644
(gdb) printf "route.c:rt_tables: %d\n", sizeof(rt_tables)
route.c:rt_tables: 152
(gdb) 


===
Data from AF_MAX = 96 (FreeBSD-stable + 59 entries)

AF_MAX: 96
/usr/src/sys/i386/compile/JUNIPER_6_2_SMP % size kernel.debug
   textdata bss dec hex filename
5964450  791752  368140 7124342  6cb576 kernel.debug

.(14:22:56)([EMAIL PROTECTED]) !!! SANDBOX UNSET!!! 
/usr/src/sys/i386/compile/JUNIPER_6_2_SMP % gdb kernel.debug
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...
(gdb) printf "AF_MAX: %d\n", sizeof(((struct ifnet *)0)->if_afdata) / 
sizeof(void*)
AF_MAX: 96
(gdb) printf "struct netexport: %d\n", sizeof(struct netexport)
struct netexport: 552
(gdb) printf "struct ifnet: %d\n", sizeof(struct ifnet)
struct ifnet: 880
(gdb) printf "route.c:rt_tables: %d\n", sizeof(rt_tables)
route.c:rt_tables: 388
(gdb) %


===
Summary of differences:

size:
   textdata bss dec hex filename
5964450  791752  367916 7124118  6cb496 kernel.debugSMALLMAX
5964450  791752  368140 7124342  6cb576 kernel.debug

AF_MAX: 37
struct netexport: 316
struct ifnet: 644
route.c:rt_tables: 152

AF_MAX: 96
struct netexport: 552
struct ifnet: 880
route.c:rt_tables: 388

bss diff: bytes: 224 percent: 1%
dec diff: bytes: 224 percent: 1%
AF_MAX: difference: 59 percent: 62%
struct netexport: bytes: 236 percent: 43%
struct ifnet: bytes: 236 percent: 27%
route.c:rt_tables: bytes: 236 percent: 61%

===
Unknown:  (I don't know how to get a static variable from gdb)

unknown: netatm/atm_if.c: -> atm_ifouttbl

===

Re: OS choice for an edge router

2007-09-06 Thread Alfred Perlstein

* Kirc Gover <[EMAIL PROTECTED]> [070906 11:10] wrote:
> We are in the stage of planning and research for a commercial development of 
> an edge router that will be based mostly on OpenSource software. I would like 
> to solicit for information and recommendation if FreeBSD is a suitable OS. 
> The router is expected to withstand forwarding of sustained traffic from 
> 10Mbps to 1Gbps and maybe more than that. Are there any known limitations of 
> FreeBSD in terms of architecture and performance? Can I just take out a 
> FreeBSD as is and put it with the hardware without any specific or major 
> refinements in its code? I'm  very much concerned with its capability in 
> forwarding heavy sustained traffic. Packet loss should be at minimum and 
> critical userland processes should working normally  even under heavy load. 
> Are there any known specific limitations of FreeBSD? I have browsed through 
> the archives and found a lot of hangups, deadlocks and freeze issues. What is 
> the usual or minimum hardware requirement? Is soekris box enough, or dual 
> core or ASIC
>  based platforms? I'm aware that there are so many FreeBSD based routers and 
> network based devices in the market. Is this a way to go over realtime and 
> embedded OS such as VxWorks and others (mostly commercial) without putting 
> the licensing cost in picture? I really appreciate any help, suggestions and 
> recommendations. More power to FreeBSD!
>  
>  Thanks
>  Kirc

Kirc, do some research into Juniper routers. :)

1gps shouldn't be a problem for FreeBSD, however you may have to 
do some custom tweaks that I can't get into for obvious reasons.

I don't think a soekris would be sufficient for 1Gbps, however a
mid-range to high-end PC with good NICS and smart software should
suffice.

I think going with FreeBSD would be a great choice.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


(forw) Re: Allocating AF constants for vendors.

2007-09-05 Thread Alfred Perlstein
Bruce, I haven't heard back from you on this.  can you please comment?

I'd like to add the policy to the header.

- Forwarded message from Alfred Perlstein <[EMAIL PROTECTED]> -----

From: Alfred Perlstein <[EMAIL PROTECTED]>
To: "Bruce M. Simpson" <[EMAIL PROTECTED]>
Cc: Max Laier <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Subject: Re: Allocating AF constants for vendors.
Date: Tue, 4 Sep 2007 05:42:24 -0700
Message-ID: <[EMAIL PROTECTED]>
User-Agent: Mutt/1.4.2.3i
Sender: [EMAIL PROTECTED]

* Bruce M. Simpson <[EMAIL PROTECTED]> [070904 03:08] wrote:
> >As you can see we are defering the "bloat".
> >Does that make sense?
> >  
> 
> I follow but it still doesn't really make sense.
> 
> Granted, you are deferring the growth of arrays sized off AF_MAX but 
> only ever by 1 slot.
> What if Vendor Z wants to add 25 entries at once?

Then as long as they allocate odd numbered entries they should
be fine.  FreeBSD's AF_MAX does not need to change to accomidate
a vendor, it only has to restrict itself to even numbered slots.

> We would also be tying ourselves down to the notion of a vendor in any 
> AF_ allocation. Is this an avenue that people are happy to pursue?

Yes, until the "horrific" problem of the statically sized arrays
is "fixed".  Then the allocation policy can change.


-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

- End forwarded message -

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Allocating AF constants for vendors.

2007-09-04 Thread Alfred Perlstein
* Randall Stewart <[EMAIL PROTECTED]> [070904 13:22] wrote:
> Alfred Perlstein wrote:
> >* Bruce M. Simpson <[EMAIL PROTECTED]> [070904 03:08] wrote:
> >
> >>>As you can see we are defering the "bloat".
> >>>Does that make sense?
> >>>
> >>
> >>I follow but it still doesn't really make sense.
> >>
> >>Granted, you are deferring the growth of arrays sized off AF_MAX but 
> >>only ever by 1 slot.
> >>What if Vendor Z wants to add 25 entries at once?
> >
> >
> >Then as long as they allocate odd numbered entries they should
> >be fine.  FreeBSD's AF_MAX does not need to change to accomidate
> >a vendor, it only has to restrict itself to even numbered slots.
> >
> >
> >>We would also be tying ourselves down to the notion of a vendor in any 
> >>AF_ allocation. Is this an avenue that people are happy to pursue?
> >
> >
> >Yes, until the "horrific" problem of the statically sized arrays
> >is "fixed".  Then the allocation policy can change.
> >
> >
> So basically in this scheme we only have to "stumble" across an
> additional slot when we add a new one to FreeBSD.. i.e. some
> random vendor may assign 50 slots (in odd numbers) but FreeBSD
> would not see the growth until really 2 new AF_XXX's are added.
> Then you would have to bump it from by 3, to cover the two
> new ones (reserving the vendor specific slots and thus causing
> allocations of unused things).

YES!  Exactly.

> 
> This seems like a reasonable compromise to me... I can't imagine
> where we would need to add a lont of new AF_XXX's.. of course
> maybe I just lack imagination :-D

Well, Freebsd or 5 added bluetooth, and freebsd 7 has some IEEE thing
added... sooo... the array is growing, but slowly.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Allocating AF constants for vendors.

2007-09-04 Thread Alfred Perlstein
* Bruce M. Simpson <[EMAIL PROTECTED]> [070904 03:08] wrote:
> >As you can see we are defering the "bloat".
> >Does that make sense?
> >  
> 
> I follow but it still doesn't really make sense.
> 
> Granted, you are deferring the growth of arrays sized off AF_MAX but 
> only ever by 1 slot.
> What if Vendor Z wants to add 25 entries at once?

Then as long as they allocate odd numbered entries they should
be fine.  FreeBSD's AF_MAX does not need to change to accomidate
a vendor, it only has to restrict itself to even numbered slots.

> We would also be tying ourselves down to the notion of a vendor in any 
> AF_ allocation. Is this an avenue that people are happy to pursue?

Yes, until the "horrific" problem of the statically sized arrays
is "fixed".  Then the allocation policy can change.


-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Allocating AF constants for vendors.

2007-09-03 Thread Alfred Perlstein
* Bruce M. Simpson <[EMAIL PROTECTED]> [070903 07:44] wrote:
> Alfred Perlstein wrote:
> >Ok, I'm not really sure what to do here.  At Juniper we have approx
> >20 additional entries for AF_ constants.  We also have theoretical
> >but not practical "problems" with spareness and utility of this
> >list, meaning we have plenty of arrays in our version of ifnets and
> >route entries that are also "bloated" as well.
> >  
> 
> Can you merge them into the list in such a way that AF_MAX does not need 
> to slide forward?
> Or do they need to be referenced from within the kernel tree itself?

They are refenced inside the kernel.

> Prevention of code bloat is better than the cure.  Not having the code 
> in front of me I couldn't say for sure if we're talking about a dozen 
> bytes or several pages potentially being wasted, so it is impossible to 
> judge.

Well, for the most part it's going to be something like 32*sizeof(void*)
so 128 or 256 bytes depending on arch.

> One of my concerns is that we have ifnet.if_afdata, we're not really 
> using it, it makes sense to use it for some things.

I'll have ot look into this.

> Help from big companies as well as little folks is always appreciated, 
> providing we can reach consensus.

YES! :)

> >Otherwise one other policy would be to specify an allocation
> >policy such that new AF_ constants are allocated only for even
> >numbers where odd numbers are left to vendors.
> >
> >This would slow the "bloat" and still provide vendors with something
> >useful.
> >
> >How does that sound?
> >  
> 
> EPARSE? I don't follow this at all.

Ok, let's say we garantee that going forward, all odd AF_ constants 
are verdor reserved

So whenever FreeBSD allocates an AF constant, it should be even,
vendors can use odd.

That means that, from socket.h:

#define AF_ARP  35
#define AF_BLUETOOTH36  /* Bluetooth sockets */
#define AF_IEEE8021137  /* IEEE 802.11 protocol */
#define AF_MAX  38

Now let's say FreeBSD wants to add a AF constant, the next one to allocate
would be 38, so we have:

#define AF_ARP  35
#define AF_BLUETOOTH36  /* Bluetooth sockets */
#define AF_IEEE8021137  /* IEEE 802.11 protocol */
#define AF_NEWPROTO138  /* some awesome new protocol! */
#define AF_MAX  39

Ok, well that doesn't explain it much, however, shortly thereafter we
allocate another AF constant in FreeBSD, the list now looks like:

#define AF_ARP  35
#define AF_BLUETOOTH36  /* Bluetooth sockets */
#define AF_IEEE8021137  /* IEEE 802.11 protocol */
#define AF_NEWPROTO138  /* some awesome new protocol! */
#define AF_VENDOR0  39  /* reserved for vendors. */
#define AF_NEWPROTO240  /* some awesome new protocol! */
#define AF_MAX  41

Soon another protocol is added:

#define AF_ARP  35
#define AF_BLUETOOTH36  /* Bluetooth sockets */
#define AF_IEEE8021137  /* IEEE 802.11 protocol */
#define AF_NEWPROTO138  /* some awesome new protocol! */
#define AF_VENDOR0  39  /* reserved for vendors. */
#define AF_NEWPROTO240  /* some awesome new protocol! */
#define AF_VENDOR1  41      /* reserved for vendors. */
#define AF_NEWPROTO342  /* some awesome new protocol! */
#define AF_MAX  43

As you can see we are defering the "bloat".

Does that make sense?

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Allocating AF constants for vendors.

2007-09-02 Thread Alfred Perlstein
* Bruce M. Simpson <[EMAIL PROTECTED]> [070822 07:33] wrote:
> I second Max. If you are going to introduce a bunch of AF_* constants
> into the tree you have to be very careful as AF_MAX is used to size
> arrays and figure out how many radix trie heads to allocate.

Ok, I'm not really sure what to do here.  At Juniper we have approx
20 additional entries for AF_ constants.  We also have theoretical
but not practical "problems" with spareness and utility of this
list, meaning we have plenty of arrays in our version of ifnets and
route entries that are also "bloated" as well.

We happen not to find it a problem.

Perhaps if $BIG_ROUTER_COMPANY is not concerned about this then
that might be convincing enough to let it go?

Perhaps if I tossed in that it would be my intention to share code
to dynamically allocate the data if we ever did it ourselves.

Otherwise one other policy would be to specify an allocation
policy such that new AF_ constants are allocated only for even
numbers where odd numbers are left to vendors.

This would slow the "bloat" and still provide vendors with something
useful.

How does that sound?

-Alfred





>
> It could be argued this wastes a bunch of CPU time and memory, though I
> speculate 'not much' at the moment; I am just a bit concerned that we
> have ifnet->if_afdata which is also sized based on AF_MAX, 37, even
> though most of the protocols in it are never attached to ifnets.
>
> The only domain I've seen which really uses if_afdata is PF_INET6.
> PF_INET does not use it at all. In my opinion, there are structures
> per-family per-ifnet which really belong hung-off ifnet on a 1:1 basis
> and would simplify some of the lazy allocations we have further down in
> the stack.
>
> If AF_MAX increases significantly so will wasted memory. If you are
> going to make any significant changes here, please considering moving
> this stuff to a more dynamic method of allocation.
>
> On the other hand, if you don't need to reference these constants in the
> kernel at all, and they will all exist beyond AF_MAX, then you can
> disregard what I've said and append them to the rest of the list.
>
> That is pretty much what happens for the libpcap/bpf DLT constants
> (which are not an exact analogue of the AF constants - we don't allocate
> other, larger kernel structures based on their value).
>
> regards,
> BMS
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

--
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Allocating AF constants for vendors.

2007-09-01 Thread Alfred Perlstein
* Max Laier <[EMAIL PROTECTED]> [070822 14:38] wrote:
> On Wednesday 22 August 2007, Bruce M. Simpson wrote:
> [...]
> > On the other hand, if you don't need to reference these constants in
> > the kernel at all, and they will all exist beyond AF_MAX, then you can
> > disregard what I've said and append them to the rest of the list.
> 
> Please make sure to leave a bit of space between AF_MAX and your constants 
> so we could still grow AF_MAX if the need should ever arise.

Hmm, that could work, but I think we have the same problem, we depend on AF_MAX.

I could look into a more dynamic way of allocating... possibly.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Allocating AF constants for vendors.

2007-08-21 Thread Alfred Perlstein
I trimmed the sender of this because I got it in private mail, that
said I thought it was a good bunch of questions so I am replying
to it.

> 64?  are you intending to bump AF_MAX or allocate them sequentially such 
> that adding another AF will require AF_MAX to grow a lot?
> 
> In general this seems like a bad idea to me.  I suggest you need to 
> (publicly) explain what you are doing and why this is a good idea.

The goal here is to allow vendors to add their own constants without
worrying about conflicting with FreeBSD constants.  It will allow
vendors to maintain some semblance of binary compatibility against
FreeBSD.

If you look at libpcap:

 http://cvs.tcpdump.org/cgi-bin/cvsweb/libpcap/pcap/bpf.h?rev=1.15

You can see that Juniper has asked for some number of reserved
"families", in our case, I think it would be a bit greedy to
grow the list _just_ for Juniper, so I suggested something that
would work for every vendor.

As far as implementation details, either one works for me, do you
have any particular preference?

Other than the actual delta, will this have any noticeable negative
impact that you can see?

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Allocating AF constants for vendors.

2007-08-21 Thread Alfred Perlstein
Hello all,

I would like to reserve about 64 entries for VENDOR specific address
families in sys/socket.h.

I think this will allow vendors to comfortably use the array of
address families without worrying about overlap with FreeBSD
protocols.

If no one objects I plan to commit this in the next few days.

The format will be along the lines of:

AF_VENDOR0 -> AF_VENDOR63

Suggestions?

thank you,
-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: freebsd nfs version4 server

2007-06-15 Thread Alfred Perlstein
* Dave <[EMAIL PROTECTED]> [070615 19:06] wrote:
> Hello,
>Firewalling nfs i was reading some client docs and i found out that 
> FreeBSD has client support for the nfs v4. I was wondering if FreeBSD 6.2 
> could act as an nfs v4 server?

There's a patchset from Rick Maclem(sp?) that might do it.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Firewalling NFS

2007-06-15 Thread Alfred Perlstein
* Jeremie Le Hen <[EMAIL PROTECTED]> [070615 01:07] wrote:
> Hi,
> 
> It appears nearly impossible to firewall a NFS server on FreeBSD.

I would be nearly impossible if one didn't know much about NFS.

Care to rephrase your assertion?

> The reason is that NFS related daemons use RPC, which means they
> don't bind to a deterministic port.  Only mountd(8) can be requested to
> bind to a specific port or fail with the -p command-line switch.
> Is there any reason other than "no one has needed this yet" why this
> option is not available for nfsd(8), rpc.lockd(8) and rpc.statd(8)?

this is wrong, wrong and more wrong.

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: New driver coming soon.

2007-06-03 Thread Alfred Perlstein
That's typically left to the driver author's discression, so go at it.

* Jack Vogel <[EMAIL PROTECTED]> [070530 17:53] wrote:
> I wanted to let everyone know that I will soon have a
> new 10G driver to add to the tree. It is a PCI Express
> MSI/X adapter, I would like to call this driver 'ix' rather
> than follow Linux who are calling it 'ixgbe'. It is not
> backwardly compatible with ixgb.   Any objections
> to the name? It would be nice to get this in before
> 7 becomes a RELEASE, what time frame do I
> have for that?
> 
> Cheers,
> 
> Jack
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: NAT Traversal Patches ...

2007-05-11 Thread Alfred Perlstein
Matthew, can you provide links to the patches and surrounding
discussion.  It may just be a matter of integration manpower...

* Matthew Grooms <[EMAIL PROTECTED]> [070511 08:08] wrote:
> 
> All,
> 
>  I understand that FreeBSD is a volunteer project, but does anyone
> have any information regarding the status of the IPsec NAT Traversal
> patches and their inclusion with FeeBSD? I have seen them floating
> around this list for a few years now. At one point, there was an
> objection that concerned a possible legal issue related to patents. This
> can't be too much of a road block as Linux, OpenBSD and NetBSD all
> include support for NATT in official stable kernel sources. Fedora Core
> 6 even has the feature enabled by default in the generic kernel. Another
> objection I have seen was related to the patch only offering support for
> the KAME stack. But the most recent patch set also offers support for
> the Fast IPsec stack as well.
> 
>  Is the patch lacking sponsorship by a FreeBSD developer sponsor
> since the author does not have commit access? Maybe a developer looking
> at the patch is just short on time at the moment? If so, is there
> another developer that could maybe help out? Is there a technical reason
> why the patches have not been committed? If so, I don't think the
> author is aware so a little communication is required?
> 
>  Lastly, is there anything the community can do to help out? Maybe
> donating to a FreeBSD Foundation project that sponsors IPsec related
> work?
> 
> Thanks,
> 
> -Matthew
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"

-- 
- Alfred Perlstein
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: setsockopt() can not remove the accept filter

2005-06-11 Thread Alfred Perlstein
"size (%d vs expected %d)", len, sizeof(afa));
> - printf("ok 8 - setsockopt\n");
> + printf("ok 9 - setsockopt\n");
> 
>   /*
> -  * Step 8: After setsockopt().  Should succeed and identify
> +  * Step 9: After setsockopt().  Should succeed and identify
>* ACCF_NAME.
>*/
>   bzero(&afa, sizeof(afa));
>   len = sizeof(afa);
>   ret = getsockopt(lso, SOL_SOCKET, SO_ACCEPTFILTER, &afa, &len);
>   if (ret != 0)
> - errx(-1, "not ok 9 - getsockopt() after listen() setsockopt() "
> + errx(-1, "not ok 10 - getsockopt() after listen() setsockopt() "
>   "failed with %d (%s)", errno, strerror(errno));
>   if (len != sizeof(afa))
> - errx(-1, "not ok 9 - getsockopt() after setsockopet()  after "
> + errx(-1, "not ok 10 - getsockopt() after setsockopet()  after "
>   "listen() returned wrong size (got %d expected %d)", len,
>   sizeof(afa));
>   if (strcmp(afa.af_name, ACCF_NAME) != 0)
> - errx(-1, "not ok 9 - getsockopt() after setsockopt() after "
> + errx(-1, "not ok 10 - getsockopt() after setsockopt() after "
>   "listen() mismatch (got %s expected %s)", afa.af_name,
>   ACCF_NAME);
> - printf("ok 9 - getsockopt\n");
> + printf("ok 10 - getsockopt\n");
> +
> + /*
> +  * Step 10: Remove accept filter.  After removing the accept filter
> +  * getsockopt() should fail with EINVAL.
> +  */
> + ret = setsockopt(lso, SOL_SOCKET, SO_ACCEPTFILTER, NULL, 0);
> + if (ret != 0)
> + errx(-1, "not ok 11 - setsockopt() after listen() "
> + "failed with %d (%s)", errno, strerror(errno));
> + bzero(&afa, sizeof(afa));
> + len = sizeof(afa);
> + ret = getsockopt(lso, SOL_SOCKET, SO_ACCEPTFILTER, &afa, &len);
> + if (ret == 0)
> + errx(-1, "not ok 11 - getsockopt() after removing "
> + "the accept filter returns valid accept filter %s",
> + afa.af_name);
> + if (errno != EINVAL)
> + errx(-1, "not ok 11 - getsockopt() after removing the accept"
> + "filter failed with %d (%s)", errno, strerror(errno));
> + printf("ok 11 - setsockopt\n");
> 
>   close(lso);
>   return (0);
> %%%
> 
> -- 
> Maxim Konovalov

-- 
- Alfred Perlstein
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Linux compatible rpc.lockd

2004-11-25 Thread Alfred Perlstein
* Bruce M Simpson <[EMAIL PROTECTED]> [041125 13:53] wrote:
> On Thu, Nov 25, 2004 at 08:18:12PM +0100, Bj?rn Gr?nvall wrote:
> > I have made a patch to address PR kern/56461, in short the patch
> > provides two different options to be compatible with Linux lockd
> > implementations. It can also serve as a basis for a future more robust
> > rpc.lockd.
> 
> Thank you for this. I looked at this around 8 months ago but abandoned
> further work on it because the approach I was taking required that
> nfs be refactored to use the nmount() API, and because I am not currently
> using NFS. It looks as though the two options implemented here helps to
> address the problems I was having with making sure Linux servers got
> the right lock cookie response.
> 
> Have you tested this in production and does it work well? If so I believe
> it should be committed, but I'd defer to Alfred for further review.

It looks non0invasive enough to be safe.  Please see if you can 
get a test run and commit it.  I'm in the hospital and not able to
do stuff.

-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


(forw) Re: kern/72396: Incorrect network accounting with aliases.

2004-10-06 Thread Alfred Perlstein
I submitted a PR with a patch, but I think there may be a better
fix, any ideas?

-Alfred

- Forwarded message from [EMAIL PROTECTED] -

From: [EMAIL PROTECTED]
Reply-To: [EMAIL PROTECTED], [EMAIL PROTECTED]
To: Alfred Perlstein <[EMAIL PROTECTED]>
Subject: Re: kern/72396: Incorrect network accounting with aliases.
Date: Wed, 6 Oct 2004 17:50:29 GMT
Message-Id: <[EMAIL PROTECTED]>

Thank you very much for your problem report.
It has the internal identification `kern/72396'.
The individual assigned to look at your
report is: freebsd-bugs. 

You can access the state of your problem report at any time
via this link:

http://www.freebsd.org/cgi/query-pr.cgi?pr=72396

>Category:   kern
>Responsible:freebsd-bugs
>Synopsis:   Incorrect network accounting with aliases.
>Arrival-Date:   Wed Oct 06 17:50:29 GMT 2004

- End forwarded message -

-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: aio patch for review.

2004-09-30 Thread Alfred Perlstein
* Alan Cox <[EMAIL PROTECTED]> [040930 21:19] wrote:
> On Thu, Sep 30, 2004 at 02:18:14AM -0700, Alfred Perlstein wrote:
> > properly cover the socket buffer for operations that need locking.
> > 
> 
> Just to be clear, your point is that soreadable() and sowriteable()
> should be performed with the corresponding socket buffer locked.
> Correct?  If so, yes, please go ahead and commit it.

Yup.

thank you,
-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


aio patch for review.

2004-09-30 Thread Alfred Perlstein
properly cover the socket buffer for operations that need locking.

please review.


Index: vfs_aio.c
===
RCS file: /home/ncvs/src/sys/kern/vfs_aio.c,v
retrieving revision 1.176
diff -u -r1.176 vfs_aio.c
--- vfs_aio.c   23 Sep 2004 14:45:04 -  1.176
+++ vfs_aio.c   30 Sep 2004 09:15:10 -
@@ -1297,6 +1297,7 @@
struct kevent kev;
struct kqueue *kq;
struct file *kq_fp;
+   struct sockbuf *sb;
 
aiocbe = uma_zalloc(aiocb_zone, M_WAITOK);
aiocbe->inputcharge = 0;
@@ -1451,29 +1452,28 @@
 * If it is not ready for io, then queue the aiocbe on the
 * socket, and set the flags so we get a call when sbnotify()
 * happens.
+*
+* Note if opcode is neither LIO_WRITE nor LIO_READ we lock
+* and unlock the snd sockbuf for no reason.
 */
so = fp->f_data;
+   sb = (opcode == LIO_READ) ? &so->so_rcv : &so->so_snd;
+   SOCKBUF_LOCK(sb);
s = splnet();
if (((opcode == LIO_READ) && (!soreadable(so))) || ((opcode ==
LIO_WRITE) && (!sowriteable(so {
TAILQ_INSERT_TAIL(&so->so_aiojobq, aiocbe, list);
TAILQ_INSERT_TAIL(&ki->kaio_sockqueue, aiocbe, plist);
-   if (opcode == LIO_READ) {
-   SOCKBUF_LOCK(&so->so_rcv);
-   so->so_rcv.sb_flags |= SB_AIO;
-   SOCKBUF_UNLOCK(&so->so_rcv);
-   } else {
-   SOCKBUF_LOCK(&so->so_snd);
-   so->so_snd.sb_flags |= SB_AIO;
-   SOCKBUF_UNLOCK(&so->so_snd);
-   }
+   sb->sb_flags |= SB_AIO;
aiocbe->jobstate = JOBST_JOBQGLOBAL; /* XXX */
ki->kaio_queue_count++;
num_queue_count++;
+   SOCKBUF_UNLOCK(sb);
splx(s);
error = 0;
goto done;
    }
+   SOCKBUF_UNLOCK(sb);
splx(s);
}
 
-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: kern/56461: FreeBSD client rpc.lockd incompatible with Linux server rpc.lockd

2004-06-18 Thread Alfred Perlstein
* Barney Wolff <[EMAIL PROTECTED]> [040618 14:09] wrote:
> On Fri, Jun 18, 2004 at 10:51:21AM -0700, Alfred Perlstein wrote:
> > 
> > *Sigh* make it a sysctl, but can someone please lay the smack
> > down on the linuxiots and have them fix thier crap?
> > 
> > * Bruce M Simpson <[EMAIL PROTECTED]> [040618 04:50] wrote:
> > > 
> > > Linux NFS advisory locks are broken and incompatible with the rest
> > > of the world. FreeBSD 5.x in particular uses BSD/OS derived NFS code
> > > and thus is affected. FreeBSD 4.x does not implement client-side NFS
> > > advisory locks.
> > > 
> > > This problem is also documented as existing for MacOS X, IRIX and BSD/OS:
> > > http://www.netsys.com/bsdi-users/2002-04/msg00036.html
> > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0311.0/0498.html
> > > http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001833.html
> > > http://lists.freebsd.org/pipermail/freebsd-hackers/2003-April/000592.html
> > > 
> > > The patch provided in the PR is verified to solve the problem, but
> > > it would be good to make this functionality optional at run-time,
> > > as many people are likely to be using Linux NFS shares read/write
> > > with advisory locks.
> 
> Pardon an ignorant question, but what happens to unfortunate people who
> have to talk to both Linux and non-quirky servers at the same time?  Is
> there a way to detect what flavor of server you're talking to and adjust
> accordingly?  That would be far better than a sysctl.

Mount option?  Can we do that these days?

-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: kern/56461: FreeBSD client rpc.lockd incompatible with Linux server rpc.lockd

2004-06-18 Thread Alfred Perlstein
This fucking sucks.

*Sigh* make it a sysctl, but can someone please lay the smack
down on the linuxiots and have them fix thier crap?



* Bruce M Simpson <[EMAIL PROTECTED]> [040618 04:50] wrote:
> I've attached my thoughts on this issue. I haven't gone ahead and
> committed the fix in the PR as it makes us just as braindead as Linux,
> but it would be good to be able to have this in GENERIC so that it
> can be enabled in those situations where it's needed.
> 
> Regards,
> BMS

> Synopsis:
> 
> Linux NFS advisory locks are broken and incompatible with the rest
> of the world. FreeBSD 5.x in particular uses BSD/OS derived NFS code
> and thus is affected. FreeBSD 4.x does not implement client-side NFS
> advisory locks.
> 
> This problem is also documented as existing for MacOS X, IRIX and BSD/OS:
> http://www.netsys.com/bsdi-users/2002-04/msg00036.html
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0311.0/0498.html
> http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001833.html
> http://lists.freebsd.org/pipermail/freebsd-hackers/2003-April/000592.html
> 
> The patch provided in the PR is verified to solve the problem, but
> it would be good to make this functionality optional at run-time,
> as many people are likely to be using Linux NFS shares read/write
> with advisory locks.
> 
> Walkthrough:
> 
> The addition of pid_start to struct lockd_msg_ident is what triggered
> this problem. The offending member is referenced by the NFS code, and
> rpc.lockd itself.
> 
> The kernel interface code for rpc.lockd resides in
> src/usr.sbin/rpc.lockd/kern.c.
> 
> LOCKD_MSG is what gets passed from the kernel to rpc.lockd via the
> named pipe /var/run/lock.
> 
> NFSCLNT_LOCKDANS is used by lockd to send a response back. struct
> lockd_ans is the structure passed via this syscall. The kernel code
> for this is in nfslockdans(), in src/sys/nfsclient/nfs_lock.c.
> 
> Proposed solution:
> 
> Actual NLM request conversion to/from the kernel happens in rpc.lockd;
> there are several places in kern.c, notably test_request() and
> lock_request(), which reference struct nlm4_testargs, struct nlm_testargs,
> struct nlm_lockargs, and struct nlm4_lockargs.
> These are defined in src/include/rpcsvc/nlm_prot.x.
> 
> XXX Are the lockd cookies different from the regular NFS filehandles?
> 
>   arg4.cookie.n_bytes = (char *)&msg->lm_msg_ident;
>   arg4.cookie.n_len = sizeof(msg->lm_msg_ident);
> 
> There's no need to change this structure, just the number of bytes
> provided by it; the lm_msg_ident structure needs to change if we're
> doing Linux compatbility, and is probably best served by adding
> a sysctl to keep track of whether we're in this mode or not.
> 
> So embedding a union of structs in lm_msg_ident is probably the way to go,
> and taking the sizeof() the embedded struct as appropriate.
> 
> I would suggest adding a sysctl to the tree: vfs.nfs.pid_start_locks,
> "Use process start time as well as PID to differentiate client-side NFS locks".
> This should be referenced from nfslockdans() as per the original patch
> to check if the timercmp comparison should be skipped.




-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "netstat -m" and sendfile(2) statistics in STABLE

2004-06-17 Thread Alfred Perlstein
* Mike Silbersack <[EMAIL PROTECTED]> [040617 23:20] wrote:
> 
> On Fri, 18 Jun 2004, Igor Sysoev wrote:
> 
> >Hi,
> >
> >I read objections in cvs-all@ about netstat's output after MFC
> >of sendfile(2) statistics.
> >
> >How about "netstat -ms" ?
> >
> >Right now this switch combination is treated as simple "-m" in both -STABLE
> >and -CURRENT.
> >
> >
> >Igor Sysoev
> >http://sysoev.ru/en/
> 
> I would prefer that sfbufs statistics either be kept in netstat -m, OR 
> added to an entirely different program (perhaps vmstat).  Making yet 
> another netstat flag just because we're scared of confusing users is a 
> noble compromise, but will in the end just make things more confusing.

I was going to suggest vmstat now that sfbufs are used for so many
other things than just "sendfile bufs".

-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "netstat -m" and sendfile(2) statistics in STABLE

2004-06-17 Thread Alfred Perlstein
* Igor Sysoev <[EMAIL PROTECTED]> [040617 22:52] wrote:
> Hi,
> 
> I read objections in cvs-all@ about netstat's output after MFC
> of sendfile(2) statistics.
> 
> How about "netstat -ms" ?
> 
> Right now this switch combination is treated as simple "-m" in both -STABLE
> and -CURRENT.

I would love to see the sendfile stats moved to '-s'.

If that's what you're proposing, then yes. :)

Oh last of the nits: changes to userland output make things like
examples from documentation out of date which can obfuscate things
and/or ruin docs for a release.

-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


thanks all (was: Re: crossover between gigE?)

2003-12-20 Thread Alfred Perlstein
I had a cat5e cable, but either:
a) the box needed to reboot
b) 4.9 has a problem whereas 4-stable post 4.9 is ok with em0

I dunno, but it's working with a standard cat5e cable now after
the upgrade.

* Michael Sierchio <[EMAIL PROTECTED]> [031220 14:13] wrote:
> Alfred Perlstein wrote:
> >Any suggestion of the kind of cable one should look for at Frys
> >to run between two gigE card (intel em0) to function as a crossover?
> >
> >
> 
> I was under the impression that copper gigE cards were auto-sensing
> for polarity and it didn't matter whether you use a straight or crossover.
> 
> -- 
> 
> "Well," Brahma said, "even after ten thousand explanations, a fool is no
>  wiser, but an intelligent man requires only two thousand five hundred."
> - The Mahabharata

-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


crossover between gigE?

2003-12-20 Thread Alfred Perlstein
Any suggestion of the kind of cable one should look for at Frys
to run between two gigE card (intel em0) to function as a crossover?


-- 
- Alfred Perlstein
- Research Engineering Development Inc.
- email: [EMAIL PROTECTED] cell: 408-480-4684
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: misc/44361: possible raw socket bug

2003-01-18 Thread Alfred Perlstein
It appears that we expect the ip_len and ip_off feilds to be sent
in host byte order as the stack will fix it to network byte order
in ip_output.

Is this a bug or feature? :)

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
'Instead of asking why a piece of software is using "1970s technology,"
 start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: IP Fragmentation

2002-07-17 Thread Alfred Perlstein

* shubha mr <[EMAIL PROTECTED]> [020717 03:50] wrote:
> Hi,
> I am writing a gigabit ethernet driver for one of the
> NICs.My hardware is capable of computing the checksum
> and hence I am enabling per-packet handling of
> TCP/IP/UDP checksum offload in transmit side.I would
> like to know if there is a way by which I can tell the
> upperguy that I will not be able to compute the tcp
> checksum for the fragmented packets.That is I want to
> indicate that checksum offload can be offloaded only
> for the non fragmented and hence complete packets
> only.

>From mbuf.h:

#define CSUM_IP 0x0001  /* will csum IP */
#define CSUM_TCP0x0002  /* will csum TCP */
#define CSUM_UDP0x0004  /* will csum UDP */
#define CSUM_IP_FRAGS   0x0008  /* will csum IP fragments */
#define CSUM_FRAGMENT   0x0010  /* will do IP fragmentation */

Just use the first 3, have a look at the if_bge.c driver for
an example.

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
'Instead of asking why a piece of software is using "1970s technology,"
 start asking why software is ignoring 30 years of accumulated wisdom.'
Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



Re: mbuf external buffer reference counters

2002-07-11 Thread Alfred Perlstein

* Julian Elischer <[EMAIL PROTECTED]> [020712 00:00] wrote:
> 
> 
> On Thu, 11 Jul 2002, Alfred Perlstein wrote:
> > 
> > That's true, but could someone explain how one can safely and
> > effeciently manipulate such a structure in an SMP environment?
> 
> what does NetBSD do for that?

They don't!

 *** waves skull staff exasperatedly ***

RORWLRLRLLRL

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-net" in the body of the message



  1   2   3   >