Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 29.08.2013 15:49, Adrian Chadd wrote:

Hi,

Hello Adrian!
I'm very sorry for the looong reply.



There's a lot of good stuff to review here, thanks!

Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to 
keep locking things like that on a per-packet basis. We should be able 
to do this in a cleaner way - we can defer RX into a CPU pinned 
taskqueue and convert the interrupt handler to a fast handler that 
just schedules that taskqueue. We can ignore the ithread entirely here.


What do you think?
Well, it sounds good :) But performance numbers and Jack opinion is more 
important :)


Are you going to Malta?


Totally pie in the sky handwaving at this point:

* create an array of mbuf pointers for completed mbufs;
* populate the mbuf array;
* pass the array up to ether_demux().

For vlan handling, it may end up populating its own list of mbufs to 
push up to ether_demux(). So maybe we should extend the API to have a 
bitmap of packets to actually handle from the array, so we can pass up 
a larger array of mbufs, note which ones are for the destination and 
then the upcall can mark which frames its consumed.


I specifically wonder how much work/benefit we may see by doing:

* batching packets into lists so various steps can batch process 
things rather than run to completion;
* batching the processing of a list of frames under a single lock 
instance - eg, if the forwarding code could do the forwarding lookup 
for 'n' packets under a single lock, then pass that list of frames up 
to inet_pfil_hook() to do the work under one lock, etc, etc.
I'm thinking the same way, but we're stuck with 'forwarding lookup' due 
to problem with egress interface pointer, as I mention earlier. However 
it is interesting to see how much it helps, regardless of locking.


Currently I'm thinking that we should try to change radix to something 
different (it seems that it can be checked fast) and see what happened.
Luigi's performance numbers for our radix are too awful, and there is a 
patch implementing alternative trie:

http://info.iet.unipi.it/~luigi/papers/20120601-dxr.pdf
http://www.nxlab.fer.hr/dxr/stable_8_20120824.diff




Here, the processing would look less like "grab lock and process to 
completion" and more like "mark and sweep" - ie, we have a list of 
frames that we mark as needing processing and mark as having been 
processed at each layer, so we know where to next dispatch them.


I still have some tool coding to do with PMC before I even think about 
tinkering with this as I'd like to measure stuff like per-packet 
latency as well as top-level processing overhead (ie, 
CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC 
interrupts on that core, etc.)

That will be great to see!


Thanks,



-adrian



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 14.09.2013 22:49, Olivier Cochard-Labbé wrote:

On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:

IXIA ? For the timescales we need to address we don't need an IXIA,
a netmap sender is more than enough


The great netmap generates only one IP flow (same src/dst IP and same
src/dst port).
This don't permit to test multi-queue NIC (or SMP packet-filter) on a
simple lab like this:
netmap sender => freebsd router => netmap receiver
I've got the variant which is capable on doing linerate pcap replays on 
single queue.

(However this is true for small pcaps only)


Regards,

Olivier


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 29.08.2013 02:24, Andre Oppermann wrote:

On 28.08.2013 20:30, Alexander V. Chernikov wrote:

Hello list!


Hello Alexander,

Hello Andre!
I'm very sorry to answer so late.


you sent quite a few things in the same email.  I'll try to respond
as much as I can right now.  Later you should split it up to have
more in-depth discussions on the individual parts.

If you could make it to the EuroBSDcon 2013 DevSummit that would be
even more awesome.  Most of the active network stack people will be
there too.
I've sent presentation describing nearly the same things to devsummit@ 
so I hope this can be discussed in Networking group.

I hope to attend DevSummit & EuroBSDcon.


There is a lot constantly raising discussions related to networking 
stack performance/changes.


I'll try to summarize current problems and possible solutions from my 
point of view.
(Generally this is one problem: stack is 
slooow, but we need to know why and

what to do).


Compared to others its not thaaat slow. ;)


Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 
'ascii-art' exporter).


Note that we are using process-to-completion model, e.g. process any 
packet in ISR until it is either

consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it 
does not change much:
it can help to do more fine-grained hashing (for GRE or other similar 
traffic), but

1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of 
flags in `netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
modified PPPoe/GRE version)

report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
since there is nearly no contention
(the only thing that can happen is driver reconfiguration which is 
rare and, more signifficant, we

do this once
for the batch of packets received in given interrupt). However, due 
to some (im)possible deadlocks

current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing:
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any 
readers present
(and mutex for any matching packets, but this is more or less OK. 
Additionally, there is WIP to

implement multiqueue BPF
and there is chance that we can reduce lock contention there).


Rlock to rmlock?

Yes, probably.



There is also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers 
(which implies rlock)


I believe longer term we should solve this with a protocol type 
"ethernet"

so that one can send/receive ethernet frames through a normal socket.

Yes. AF_LINK or any similar.


2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
more funny - we use complex

vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be 
changed more or less easily.


Indeed.

There are some useful terms/techniques in world of software/hardware 
routing: they have clear

'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, 
lagg hellos, ARP/NDP, etc..) and
some data traffic (packets with TTL=1, with options, destined to 
hosts without ARP/NDP record, and
similar). Latter one is done in hardware (or effective software 
implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point

we are missing nearly everywhere.


ACK.

What I want to say is: lagg is pure control-plane stuff and vlan is 
nearly the same. We can't apply
this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or 
ix* in lagg with or without vlans

on top of lagg).


ACK.

We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
add some more. We even have

per-driver hooks to program HW filtering.


We could.  Though for vlan it looks like it would be easier to remove the
hardware vlan tag stripping and insertion.  It only adds complexity in 
all

drivers for no gain.
No. Actually as far as I understand it helps driver to perform TSO. 
Anyway, IMO we should use HW capabilities if we can.
(this probably does not add much speed on 1G, but on 10/20/40G this can 
help much more).


One small step to do is to thro

Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 29.08.2013 05:32, Slawa Olhovchenkov wrote:

On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote:


..
while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the 
same-class hardware and
_userland_ forwarding.

Those numbers sound a bit far out.  Maybe if the packet isn't touched
or looked at at all in a pure netmap interface to interface bridging
scenario.  I don't believe these numbers.

80*64*8 = 40.960 Gb/s
May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell.
Intel introduces DDIO instead of DCA: 
http://www.intel.com/content/www/us/en/io/direct-data-i-o.html

(and it seems DCA does not help much):
https://www.myricom.com/software/myri10ge/790-how-do-i-enable-intel-direct-cache-access-dca-with-the-linux-myri10ge-driver.html
https://www.myricom.com/software/myri10ge/783-how-do-i-get-the-best-performance-with-my-myri-10g-network-adapters-on-a-host-that-supports-intel-data-direct-i-o-ddio.html

(However, DPDK paper notes DDIO is of signifficant helpers)
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Network stack changes

2013-08-28 Thread Alexander V. Chernikov

Hello list!

There is a lot constantly raising  discussions related to networking 
stack performance/changes.


I'll try to summarize current problems and possible solutions from my 
point of view.
(Generally this is one problem: stack is slooow, 
but we need to know why and what to do).


Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 
'ascii-art' exporter).


Note that we are using process-to-completion model, e.g. process any 
packet in ISR until it is either

consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does 
not change much:
it can help to do more fine-grained hashing (for GRE or other similar 
traffic), but

1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of 
flags in `netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
modified PPPoe/GRE version)

report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
since there is nearly no contention
(the only thing that can happen is driver reconfiguration which is rare 
and, more signifficant, we do this once
for the batch of packets received in given interrupt). However, due to 
some (im)possible deadlocks current code

does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing: 
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html


1*) Possible BPF users. Here we have one rlock if there are any readers 
present
(and mutex for any matching packets, but this is more or less OK. 
Additionally, there is WIP to implement multiqueue BPF
and there is chance that we can reduce lock contention there). There is 
also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers 
(which implies rlock)


2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
more funny - we use complex vlan_hash with another rlock to

get vlan interface from underlying one.

This is definitely not like things should be done and this can be 
changed more or less easily.


There are some useful terms/techniques in world of software/hardware 
routing: they have clear 'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg 
hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with 
options, destined to hosts without ARP/NDP record, and similar). Latter 
one is done in hardware (or effective software implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point we are missing nearly everywhere.


What I want to say is: lagg is pure control-plane stuff and vlan is 
nearly the same. We can't apply this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or ix* 
in lagg with or without vlans on top of lagg).


We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
add some more. We even have per-driver hooks to program HW filtering.


One small step to do is to throw packet to vlan interface directly (P1), 
proof-of-concept(working in production):

http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting: 
http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
Again, this is more like HW boxes do (aggregate all counters including 
errors) (and I can't imagine what real error we can get from _lagg_).


4) If we are router, we can do either slooow ip_input() -> ip_forward() 
-> ip_output() cycle or use optimized ip_fastfwd() which falls back to 
'slow' path for multicast/options/local traffic (e.g. works exactly like 
'data plane' part).
(Btw, we can consider net.inet.ip.fastforwarding to be turned on by 
default at least for non-IPSEC kernels)


Here we have to determine if this is local packet or not, e.g. F(dst_ip) 
returning 1 or 0. Currently we are simply using standard rlock + hash of 
iface addresses.

(And some consumers like ipfw(4) do the same, but without lock).
We don't need to do this! We can build sorted array of IPv4 addresses or 
other efficient structure on every address change and use it unlocked 
with delayed garbage collection (proof-of-concept attached)
(There is another thing to discuss: maybe we can do this once somewhere 
in ip_input and mark mbuf as 'local/non-local' ? )


5, 9) Currently we have L3 ingress/egress PFIL hooks protected by 
rmlocks. This is OK.


However, 6) and 7) are not.
Firewall can u

VLANHWFILTER "upgrade"

2013-04-15 Thread Alexander V. Chernikov
Hello list.

We currently have VLAHWFILTER functionality allowing underlying
physical/virtual interfaces to be aware of vlans stacked on them.

However, this knowledge is only used to program NIC hw filter (or to
broadcast to member ifaces in lagg case).

Proposed idea is to save vlan ifp pointer inside the driver and push
packet to given vlan directly.

This changes removes 1 read lock on RX fast path.

Additionally, we can do the same in more popular case of

ix -> lagg [ -> lagg -> lagg ] -> vlan

if we solve
1) lagg interface counters issue (trivial)
2) IFF_MONITOR on lagg interface issue (not so trivial, unfortunately).

Patch to ixgbe driver attached (maybe it is better to put
ixgbe_vlan_get() and struct ifvlans directly to if_vlan.[ch]).


-- 
WBR, Alexander
Index: sys/dev/ixgbe/ixgbe.c
===
--- sys/dev/ixgbe/ixgbe.c   (revision 248704)
+++ sys/dev/ixgbe/ixgbe.c   (working copy)
@@ -2880,6 +2880,14 @@ ixgbe_allocate_queues(struct adapter *adapter)
error = ENOMEM;
goto err_rx_desc;
}
+
+   if ((rxr->vlans = malloc(sizeof(struct ifvlans), M_DEVBUF,
+   M_NOWAIT | M_ZERO)) == NULL) {
+   device_printf(dev,
+   "Critical Failure setting up vlan index\n");
+   error = ENOMEM;
+   goto err_rx_desc;
+   }
}
 
/*
@@ -4271,6 +4279,11 @@ ixgbe_free_receive_buffers(struct rx_ring *rxr)
rxr->ptag = NULL;
}
 
+   if (rxr->vlans != NULL) {
+   free(rxr->vlans, M_DEVBUF);
+   rxr->vlans = NULL;
+   }
+
return;
 }
 
@@ -4303,7 +4316,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *
 return;
 }
IXGBE_RX_UNLOCK(rxr);
-(*ifp->if_input)(ifp, m);
+(*ifp->if_input)(m->m_pkthdr.rcvif, m);
IXGBE_RX_LOCK(rxr);
 }
 
@@ -4360,6 +4373,7 @@ ixgbe_rxeof(struct ix_queue *que)
u16 count = rxr->process_limit;
union ixgbe_adv_rx_desc *cur;
struct ixgbe_rx_buf *rbuf, *nbuf;
+   struct ifnet*ifp_dst;
 
IXGBE_RX_LOCK(rxr);
 
@@ -4522,9 +4536,19 @@ ixgbe_rxeof(struct ix_queue *que)
(staterr & IXGBE_RXD_STAT_VP))
vtag = le16toh(cur->wb.upper.vlan);
if (vtag) {
-   sendmp->m_pkthdr.ether_vtag = vtag;
-   sendmp->m_flags |= M_VLANTAG;
-   }
+   ifp_dst = rxr->vlans->idx[EVL_VLANOFTAG(vtag)];
+
+   if (ifp_dst != NULL) {
+   ifp_dst->if_ipackets++;
+   sendmp->m_pkthdr.rcvif = ifp_dst;
+   } else {
+   sendmp->m_pkthdr.ether_vtag = vtag;
+   sendmp->m_flags |= M_VLANTAG;
+   sendmp->m_pkthdr.rcvif = ifp;
+   }
+   } else
+   sendmp->m_pkthdr.rcvif = ifp;
+
if ((ifp->if_capenable & IFCAP_RXCSUM) != 0)
ixgbe_rx_checksum(staterr, sendmp, ptype);
 #if __FreeBSD_version >= 80
@@ -4625,7 +4649,32 @@ ixgbe_rx_checksum(u32 staterr, struct mbuf * mp, u
return;
 }
 
+/*
+ * This routine gets real vlan ifp based on
+ * underlying ifp and vlan tag.
+ */
+static struct ifnet *
+ixgbe_get_vlan(struct ifnet *ifp, uint16_t vtag)
+{
 
+   /* XXX: IFF_MONITOR */
+#if 0
+   struct lagg_port *lp = ifp->if_lagg;
+   struct lagg_softc *sc = lp->lp_softc;
+
+   /* Skip lagg nesting */
+   while (ifp->if_type == IFT_IEEE8023ADLAG) {
+   lp = ifp->if_lagg;
+   sc = lp->lp_softc;
+   ifp = sc->sc_ifp;
+   }
+#endif
+   /* Get vlan interface based on tag */
+   ifp = VLAN_DEVAT(ifp, vtag);
+
+   return (ifp);
+}
+
 /*
 ** This routine is run via an vlan config EVENT,
 ** it enables us to use the HW Filter table since
@@ -4637,7 +4686,9 @@ static void
 ixgbe_register_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
struct adapter  *adapter = ifp->if_softc;
-   u16 index, bit;
+   u16 index, bit, j;
+   struct rx_ring  *rxr;
+   struct ifnet*ifv;
 
if (ifp->if_softc !=  arg)   /* Not our event */
return;
@@ -4645,7 +4696,20 @@ ixgbe_register_vlan(void *arg, struct ifnet *ifp,
if ((vtag == 0) || (vtag > 4095))   /* Invalid */
return;
 
+   ifv = ixgbe_get_vlan(ifp, vtag);
+
IXGBE_CORE_LOCK(adapter);
+
+   if (ifp->if_capenable & IFCAP_VLAN_HWFILT

Make kernel aware of NIC queues

2013-02-06 Thread Alexander V. Chernikov

Hello list!

Today more and more NICs are capable of splitting traffic to different 
Rx/TX rings permitting OS to dispatch this traffic on different CPU 
cores. However, there are some problems that arises from using multi-nic 
(or even singe multi-port NIC) configurations:


Typical (OS) questions are:
* how much queues we should allocate per port ?
* how we should mark packets received in given queue ?
* What traffic pattern NIC is used for: should we bind queues to CPU 
cores and, if so, to which ones?


Currently, there are some AI implemented in Intel drivers like:
* use maximum available queues if CPU has large number of cores
* bind every queue to CPU core sequentially.

Problems with (probably, any AI) are:
* what NICs (ports) will be _actually_ used?
E.g:
I have 8-core system with dual 82576 Intel NIC (which is capable of 
using 8 RX queues per port).
If only one port is used, I can allocate 8 (or 7) queues and bind it to 
given cores. which is generally good for forwarding traffic.
For 2-port setups it is probably better to setup 4 queues per each port 
to make sure ithreads from different cards to not interfere with each other.


* How exactly we should mark packets?
There are traffic flows which are not hashed properly by NIC (mostly 
non-IP/IPv6 traffic, PPPoE, various tunnels are good examples) so driver 
receives all such packets on q0 and marks them with FLOWID 0, which can 
be unhandy in some situations. It can be better if we can instruct NIC 
not to mark such packets with any id permitting OS to re-calculate hash 
via probably more powerful netisr hash function.


* Traffic flow inside OS / flowid marking
Smarter flowid marking may be needed in some cases:
for example, if we are using lagg with 2 NICs for traffic forwarding,
this results in increased contention on transmit parts:
From the previos example:
port 0 has q0-q3 bound to cores 0-3
port 1 has q0-q3 bound to cores 4-7

flow ids are the same as core numbers.

lagg uses (flowid % number_nics) which leads to TX contention:
0 (0 % 2)=port0, (0 % 4)=queue0
1 (1 % 2)=port1, (1 % 4)=queue1
2 (2 % 2)=port0, (2 % 4)=queue2
3 (3 % 2)=port1, (3 % 4)=queue3
4 (4 % 2)=port0, (4 % 4)=queue0
5 (5 % 2)=port1, (5 % 4)=queue1
6 (6 % 2)=port0, (6 % 4)=queue2
7 (7 % 2)=port1, (7 % 4)=queue3

Flow IDs 0 and 4, 1 and 5, 2 and 6, 3 and 7 use the same TX queues on 
the same egress NICs.


This can be minimized by using either GCD(queues, ports)=1 
configurations (3 queues should do the trick in this case), but this 
leads to suboptimal CPU usage.


We internally uses patched igb/ix driver which permits setting flow ids 
manually (and I heard other people are using hacks to enable/disabling 
setting M_FLOWID).


I propose implementing common API to permit drivers:
* read user-supplied number of queues/other queue options (e.g:
* notify kernel of each RX/TX queue being created/destroyed
* make binding queues to cores via given API
* Export data to userland (for example, via sysctl) to permit users:
a) quickly see current configuration
b) change CPU binding on-fly
c) change flowid numbers on-fly (with the possibility to set 1) 
NIC-supplied hash 2) manually supplied value 3) disable setting M_FLOWID)


Having common interface will help users to make network stack tuning 
easier and puts us one step further to make (probably userland) AI which 
can auto-tune system according to template ("router", "webserver") and 
rc.conf configuration (lagg presense, etc..)



What do you guys think?


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [patch] reducing arp locking

2012-11-09 Thread Alexander V. Chernikov

On 09.11.2012 13:59, Fabien Thomas wrote:


Le 9 nov. 2012 à 10:05, Alexander V. Chernikov a écrit :


On 09.11.2012 12:51, Fabien Thomas wrote:


Le 8 nov. 2012 à 11:25, Alexander V. Chernikov a écrit :


On 08.11.2012 14:24, Andre Oppermann wrote:

On 08.11.2012 00:24, Alexander V. Chernikov wrote:

Hello list!

Currently we need to acquire 2 read locks to perform simple 6-byte
copying from arp record to packet
ethernet header.

It seems that acquiring lle lock for fast path (main traffic flow) is
not necessary even with
current code.

My tests shows ~10% improvement with this patch applied.

If nobody objects I plan to commit this change at the end of next week.


This is risky and prone to race conditions.  The copy of the MAC address
should be done while the table read lock is held to protect against the

It is done exactly as you say: table read lock is held.


How do you protect from entry update if i've a ref to the entry ?
You can end up doing bcopy of a partial mac address.

I see no problems in copying incorrect mac address in that case:
if host mac address id updated, this is, most likely, another host, and several 
packets being lost changes nothing.


Sending packet to a bogus mac address is not really nothing :)



However, there can be some realistic scenario where this can be the case (L2 
load balancing/failover). I'll update in_arpinput() to do lle removal/insertion 
in that case.


la_preempt modification is also write access to an unlocked structure.

This one changes nothing:
current code does this under _read_ lock.


Under the table lock not the entry lock ?

lle entry is read-locked while la_preempt is modified.


Table lock is here to protect the table if I've understood the code correctly.

Yes.

If i get an exclusive reference to the entry you will end up reading and 
writing to the entry without any lock.
Yes. And the only single drawback in worst case can be sending a bit 
more packets to right (but probably expired) MAC address.


I'm talking about the following:
ARP stack is just IP -> 6 bytes mapping, there is no reason to make it 
unnecessary complicated like rte, with references being held by upper 
layer stack. It does not contain interface pointer, etc..


We may need to r/w lock entry, but for 'control plane' code only.
If one acquired exclusive lock and wants to change STATIC flag to 
non-static or change lle address - this is simply wrong and has to be 
handled by acquiring table wlock.


Current ARP code has some flaws like handling arp expiration, but this 
patch doesn't change much here..












entry going away.  You can either return with table lock held and drop
it after the copy, or you could a modified lookup function that takes a
pointer for the copy destination, do the copy with the read lock, and then
return.  If no entry is found an error is returned and obviously no copy
is done.




--
WBR, Alexander


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"











--
WBR, Alexander


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [patch] reducing arp locking

2012-11-09 Thread Alexander V. Chernikov

On 09.11.2012 12:51, Fabien Thomas wrote:


Le 8 nov. 2012 à 11:25, Alexander V. Chernikov a écrit :


On 08.11.2012 14:24, Andre Oppermann wrote:

On 08.11.2012 00:24, Alexander V. Chernikov wrote:

Hello list!

Currently we need to acquire 2 read locks to perform simple 6-byte
copying from arp record to packet
ethernet header.

It seems that acquiring lle lock for fast path (main traffic flow) is
not necessary even with
current code.

My tests shows ~10% improvement with this patch applied.

If nobody objects I plan to commit this change at the end of next week.


This is risky and prone to race conditions.  The copy of the MAC address
should be done while the table read lock is held to protect against the

It is done exactly as you say: table read lock is held.


How do you protect from entry update if i've a ref to the entry ?
You can end up doing bcopy of a partial mac address.

I see no problems in copying incorrect mac address in that case:
if host mac address id updated, this is, most likely, another host, and 
several packets being lost changes nothing.


However, there can be some realistic scenario where this can be the case 
(L2 load balancing/failover). I'll update in_arpinput() to do lle 
removal/insertion in that case.



la_preempt modification is also write access to an unlocked structure.

This one changes nothing:
current code does this under _read_ lock.







entry going away.  You can either return with table lock held and drop
it after the copy, or you could a modified lookup function that takes a
pointer for the copy destination, do the copy with the read lock, and then
return.  If no entry is found an error is returned and obviously no copy
is done.




--
WBR, Alexander


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"





___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [patch] reducing arp locking

2012-11-08 Thread Alexander V. Chernikov

On 08.11.2012 03:46, Adrian Chadd wrote:

On 7 November 2012 15:24, Alexander V. Chernikov  wrote:

Hello list!

Currently we need to acquire 2 read locks to perform simple 6-byte copying
from arp record to packet ethernet header.

It seems that acquiring lle lock for fast path (main traffic flow) is not
necessary even with current code.

My tests shows ~10% improvement with this patch applied.

If nobody objects I plan to commit this change at the end of next week.


That's a great catch! How'd you discover it?
We have lots of FreeBSD routers doing 10G firewalling, so we're very 
much concerned with forwarding/firewalling performance, constantly 
looking for something to optimize :)




Adrian



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [patch] reducing arp locking

2012-11-08 Thread Alexander V. Chernikov

On 08.11.2012 14:24, Andre Oppermann wrote:

On 08.11.2012 00:24, Alexander V. Chernikov wrote:

Hello list!

Currently we need to acquire 2 read locks to perform simple 6-byte
copying from arp record to packet
ethernet header.

It seems that acquiring lle lock for fast path (main traffic flow) is
not necessary even with
current code.

My tests shows ~10% improvement with this patch applied.

If nobody objects I plan to commit this change at the end of next week.


This is risky and prone to race conditions.  The copy of the MAC address
should be done while the table read lock is held to protect against the

It is done exactly as you say: table read lock is held.


entry going away.  You can either return with table lock held and drop
it after the copy, or you could a modified lookup function that takes a
pointer for the copy destination, do the copy with the read lock, and then
return.  If no entry is found an error is returned and obviously no copy
is done.




--
WBR, Alexander


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


[patch] reducing arp locking

2012-11-07 Thread Alexander V. Chernikov

Hello list!

Currently we need to acquire 2 read locks to perform simple 6-byte 
copying from arp record to packet ethernet header.


It seems that acquiring lle lock for fast path (main traffic flow) is 
not necessary even with current code.


My tests shows ~10% improvement with this patch applied.

If nobody objects I plan to commit this change at the end of next week.

Index: sys/netinet/in.c
===
--- sys/netinet/in.c(revision 242524)
+++ sys/netinet/in.c(working copy)
@@ -1476,7 +1476,7 @@ in_lltable_lookup(struct lltable *llt, u_int flags
if (LLE_IS_VALID(lle)) {
if (flags & LLE_EXCLUSIVE)
LLE_WLOCK(lle);
-   else
+   else if (!(flags & LLE_UNLOCKED))
LLE_RLOCK(lle);
}
 done:
Index: sys/netinet/if_ether.c
===
--- sys/netinet/if_ether.c  (revision 242524)
+++ sys/netinet/if_ether.c  (working copy)
@@ -293,10 +293,10 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0,
struct sockaddr *dst, u_char *desten, struct llentry **lle)
 {
struct llentry *la = 0;
-   u_int flags = 0;
+   u_int flags = LLE_UNLOCKED;
struct mbuf *curr = NULL;
struct mbuf *next = NULL;
-   int error, renew;
+   int error, renew = 0;
 
*lle = NULL;
if (m != NULL) {
@@ -315,7 +315,41 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0,
 retry:
IF_AFDATA_RLOCK(ifp);
la = lla_lookup(LLTABLE(ifp), flags, dst);
+
+   /*
+* Fast path. Do not require rlock on llentry.
+*/
+   if ((la != NULL) && (flags & LLE_UNLOCKED)) {
+   if ((la->la_flags & LLE_VALID) &&
+   ((la->la_flags & LLE_STATIC) || la->la_expire > 
time_uptime)) {
+   bcopy(&la->ll_addr, desten, ifp->if_addrlen);
+   /*
+* If entry has an expiry time and it is approaching,
+* see if we need to send an ARP request within this
+* arpt_down interval.
+*/
+   if (!(la->la_flags & LLE_STATIC) &&
+   time_uptime + la->la_preempt > la->la_expire) {
+   renew = 1;
+   la->la_preempt--;
+   }
+
+   IF_AFDATA_RUNLOCK(ifp);
+   if (renew != 0)
+   arprequest(ifp, NULL, &SIN(dst)->sin_addr, 
NULL);
+
+   return (0);
+   }
+
+   /* Revert to normal path for other cases */
+   flags &= ~LLE_UNLOCKED;
+   *lle = la;
+   LLE_RNLOCK(la);
+   }
+
+
IF_AFDATA_RUNLOCK(ifp);
+
if ((la == NULL) && ((flags & LLE_EXCLUSIVE) == 0)
&& ((ifp->if_flags & (IFF_NOARP | IFF_STATICARP)) == 0)) {
flags |= (LLE_CREATE | LLE_EXCLUSIVE);
@@ -332,25 +366,6 @@ retry:
return (EINVAL);
}
 
-   if ((la->la_flags & LLE_VALID) &&
-   ((la->la_flags & LLE_STATIC) || la->la_expire > time_uptime)) {
-   bcopy(&la->ll_addr, desten, ifp->if_addrlen);
-   /*
-* If entry has an expiry time and it is approaching,
-* see if we need to send an ARP request within this
-* arpt_down interval.
-*/
-   if (!(la->la_flags & LLE_STATIC) &&
-   time_uptime + la->la_preempt > la->la_expire) {
-   arprequest(ifp, NULL, &SIN(dst)->sin_addr, NULL);
-   la->la_preempt--;
-   }
-
-   *lle = la;
-   error = 0;
-   goto done;
-   }
-
if (la->la_flags & LLE_STATIC) {   /* should not happen! */
log(LOG_DEBUG, "arpresolve: ouch, empty static llinfo for %s\n",
inet_ntoa(SIN(dst)->sin_addr));
Index: sys/net/if_llatbl.h
===
--- sys/net/if_llatbl.h (revision 242524)
+++ sys/net/if_llatbl.h (working copy)
@@ -178,6 +178,7 @@ MALLOC_DECLARE(M_LLTABLE);
 #defineLLE_EXCLUSIVE   0x2000  /* return lle xlocked  */
 #defineLLE_DELETE  0x4000  /* delete on a lookup - match 
LLE_IFADDR */
 #defineLLE_CREATE  0x8000  /* create on a lookup miss */
+#defineLLE_UNLOCKED0x1 /* return lle unlocked */
 
 #define LLATBL_HASH(key, mask) \
(((key >> 8) ^ key) >> 8) ^ key) >> 8) ^ key) & mask)
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: FreeBSD 10G forwarding performance @Intel

2012-07-03 Thread Alexander V. Chernikov

On 04.07.2012 00:27, Luigi Rizzo wrote:

On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
...

Thanks, another good point. I forgot to merge this option from andre's
patch.

Another 30-40-50kpps to win.


not much gain though.
What about the other IPSTAT_INC counters ?
Well, we should then remove all such counters (total, forwarded) and 
per-interface statistics (at least for forwarded packets).

I think the IPSTAT_INC macros were introduced (by rwatson ?)
following a discussion on how to make the counters per-cpu
and avoid the contention on cache lines.
But they are still implemented as a single instance,
and neither volatile nor atomic, so it is not even clear
that they can give reliable results, let alone the fact
that you are likely to get some cache misses.

the relevant macro is in ip_var.h.

Hm. This seems to be just per-vnet structure instance.
We've got some more real DPCPU stuff (sys/pcpu.h && kern/subr_pcpu.c) 
which can be used for global ipstat structure, however since it is 
allocated from single area without possibility to free we can't use it 
for per-interface counters.


I'll try to run tests without any possibly contested counters and report 
the results on Thursday.


Cheers
luigi



+u_int rt_count  = 1;
+SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW,&rt_count, 1, "");

@@ -601,17 +625,20 @@ passout:
 if (error != 0)
 IPSTAT_INC(ips_odropped);
 else {
-   ro.ro_rt->rt_rmx.rmx_pksent++;
+   if (rt_count)
+   ro.ro_rt->rt_rmx.rmx_pksent++;
 IPSTAT_INC(ips_forward);
 IPSTAT_INC(ips_fastforward);




cheers
luigi




--
WBR, Alexander
___
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"




--
WBR, Alexander
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: FreeBSD 10G forwarding performance @Intel

2012-07-03 Thread Alexander V. Chernikov

On 03.07.2012 20:55, Luigi Rizzo wrote:

On Tue, Jul 03, 2012 at 08:11:14PM +0400, Alexander V. Chernikov wrote:

Hello list!

I'm quite stuck with bad forwarding performance on many FreeBSD boxes
doing firewalling.

...

In most cases system can forward no more than 700 (or 1400) kpps which
is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).


among the many interesting tests you have run, i am curious
if you have tried to remove the update of the counters on route
entries. They might be another severe contention point.


21:47 [0] m@test15 netstat -I ix0 -w 1
input  (ix0)   output
   packets  errs idrops  bytespackets  errs  bytes colls
   1785514 52785 0  1213183401784650 0  117874854 0
   1773126 52437 0  1207014701772977 0  117584736 0
   1781948 52154 0  1210601261778271 0   75029554 0
   1786169 52982 0  1214511601787312 0  160967392 0
21:47 [0] test15# sysctl net.rt_count=0
net.rt_count: 1 -> 0
   1814465 22546 0  1213020761814291 0   76860092 0
   1817769 14272 0  1209849221816254 0  163643534 0
   1815311 13113 0  1208319701815340 0  120159118 0
   1814059 13698 0  1207991321813738 0  120172092 0
   1818030 13513 0  1209601401814578 0  120332662 0
   1814169 14351 0  1208361821814003 0  120164310 0

Thanks, another good point. I forgot to merge this option from andre's 
patch.


Another 30-40-50kpps to win.


+u_int rt_count  = 1;
+SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW, &rt_count, 1, "");

@@ -601,17 +625,20 @@ passout:
if (error != 0)
IPSTAT_INC(ips_odropped);
else {
-   ro.ro_rt->rt_rmx.rmx_pksent++;
+   if (rt_count)
+   ro.ro_rt->rt_rmx.rmx_pksent++;
IPSTAT_INC(ips_forward);
IPSTAT_INC(ips_fastforward);




cheers
luigi




--
WBR, Alexander
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


FreeBSD 10G forwarding performance @Intel

2012-07-03 Thread Alexander V. Chernikov

Hello list!

I'm quite stuck with bad forwarding performance on many FreeBSD boxes 
doing firewalling.


Typical configuration is E5645 / E5675 @ Intel 82599 NIC.
HT is turned off.
(Configs and tunables below).

I'm mostly concerned with unidirectional traffic flowing to single 
interface (e.g. using singe route entry).


In most cases system can forward no more than 700 (or 1400) kpps which 
is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).



Test scenario:

Ixia XM2 (traffic generator) <> ix0 (FreeBSD).

Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to
destinations in vlan11 (10.100.1.128 - 10.100.1.192).

Static arps are configured for all destination addresses.

Traffic level is slightly above or slightly below system performance.


= Test 1  ===
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, 
no firewall


Traffic: 1-1 flow (1 src, 1 dst)
(This is actually a bit different from described above)

Result:
 input  (ix0)   output
packets  errs idrops  bytespackets  errs  bytes colls
   878k   48k 059M   878k 056M 0
   874k   48k 059M   874k 056M 0
   875k   48k 059M   875k 056M 0

16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "  %7s 
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'

 STATE  C   TIMECPU COMMAND
  CPU6  6  17:28100.00%  kernel{ix0 que}
  CPU9  9  20:42 60.06%intr{irq265: ix0:que

16:41 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0 500796167
irq257: ix0:que 16693573   2245
irq258: ix0:que 22572380862
irq259: ix0:que 33166273   1062
irq260: ix0:que 49691706   3251
irq261: ix0:que 5   10766434   3611
irq262: ix0:que 68933774   2996
irq263: ix0:que 75246879   1760
irq264: ix0:que 83548930   1190
irq265: ix0:que 9   11817986   3964
irq266: ix0:que 10227561 76
irq267: ix0:link   1  0

Note that system is using 2 cores to forward, so 12 cores should be able 
to forward 4+ mpps which is more or less consistent with Linux results. 
Note that interrupts on all queues are (as far as I understand from the 
fact that AIM is turned off and interrupt rates are the same from 
previous test). Additionally, despite hw.intr_storm_threshold = 200k, 
i'm constantly getting

interrupt storm detected on "irq265:"; throttling interrupt source
message.


= Test 2  ===
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, 
no firewall


Traffic: Unidirectional many-2-many

16:20 [0] test15# netstat -I ix0 -hw 1
 input  (ix0)   output
packets  errs idrops  bytespackets  errs  bytes colls
   507k  651k 074M   508k 032M 0
   506k  652k 074M   507k 028M 0
   509k  652k 074M   508k 037M 0


16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "  %7s 
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'

 STATE  C   TIMECPU COMMAND
 CPU10  6   0:40100.00%  kernel{ix0 que}
  CPU2  2  11:47 84.86%intr{irq258: ix0:que
  CPU3  3  11:50 81.88%intr{irq259: ix0:que
  CPU8  8  11:38 77.69%intr{irq264: ix0:que
  CPU7  7  11:24 77.10%intr{irq263: ix0:que
  WAIT  1  10:10 74.76%intr{irq257: ix0:que
  CPU4  4   8:57 63.48%intr{irq260: ix0:que
  CPU6  6   8:35 61.96%intr{irq262: ix0:que
  CPU9  9  14:01 60.79%intr{irq265: ix0:que
   RUN  0   9:07 59.67%intr{irq256: ix0:que
  WAIT  5   6:13 43.26%intr{irq261: ix0:que
 CPU11 11   5:19 35.89%  kernel{ix0 que}
 -  4   3:41 25.49%  kernel{ix0 que}
 -  1   3:22 21.78%  kernel{ix0 que}
 -  1   2:55 17.68%  kernel{ix0 que}
 -  4   2:24 16.55%  kernel{ix0 que}
 -  1   9:54 14.99%  kernel{ix0 que}
  CPU0 11   2:13 14.26%  kernel{ix0 que}


16:07 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0  13654 15
irq257: ix0:que 1  87043 96
irq258: ix0:que 2  39604 44
irq259: ix0:que 3  48308 53
irq260: ix0:que 4 138002153
irq261: ix0:que 5 169596188
irq262: ix0:que 6 107679119
irq263: ix0:que 7  72769 81
irq264: ix0:que 8  30878 34
irq265: ix0:que 

Re: ifconfig accepting hostname as ipv4 address

2012-06-08 Thread Alexander V. Chernikov

On 08.06.2012 11:20, Jonathan McKeown wrote:

On Thursday 07 June 2012 17:00:04 Alexander V. Chernikov wrote:

Hello list!

Since the early days ifconfig(8) has the following functionality:


[hostname in place of literal address]


Moreover, ifconfig em0 some_valid_fqdn/MASK silently ignores it, so you
can't set valid CIDR address using this notation.


I'm not sure that's true. Have you tried it? Because it seems to work here.
Strangely enough, it works on another machine. Ok, this one works and 
can unfortunately be used by other people.


However, original question remains.



Jonathan



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


ifconfig accepting hostname as ipv4 address

2012-06-07 Thread Alexander V. Chernikov

Hello list!

Since the early days ifconfig(8) has the following functionality:

..
 address
 For the DARPA-Internet family, the address is either a 
host name
 present in the host name data base, hosts(5), or a DARPA 
Internet

 address expressed in the Internet standard “dot notation”.

E.g. one can write `ifconfig em0 some_possibly_unqualified_fqdn` and get
inet address assigned to the card with classful mask.

Now this can lead to "fun" things if you have misprinted some keyword 
and this keyword exists in the local DNS zone (or wildcard is configured).


The most favorite one (we have wilcard configured in one of our search 
zones):

18:45 [0] dhcp170-36-red# ifconfig vlan123 desroy
18:45 [0] dhcp170-36-red# echo $?
0
18:45 [0] dhcp170-36-red# ifconfig vlan123
vlan123: flags=8003 metric 0 mtu 1500
ether 00:00:00:00:00:00
inet 213.180.204.242 netmask 0xff00 broadcast 213.180.204.255
inet6 fe80::222:4dff:fe50:cd2f%vlan123 prefixlen 64 scopeid 0xd
nd6 options=21
vlan: 0 parent interface: 

This is also one of the reasons why ifconfig sometimes "hangs" on 
invalid input.
Moreover, ifconfig em0 some_valid_fqdn/MASK silently ignores it, so you 
can't set valid CIDR address using this notation.


Classful era has ended more than 10 years ago, do we still want to keep 
this behavior?



--
WBR, Alexander
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: FIB separation

2011-09-07 Thread Alexander V. Chernikov

On 07.09.2011 11:17, Julian Elischer wrote:

On 7/16/11 5:43 AM, Vlad Galu wrote:

Hello,


Hello!

A couple of years ago, Stef Walter proposed a patch[1] that enforced
the scope of routing messages. The general consesus was that the best
approach would be the OpenBSD way - transporting the FIB number in the
message and letting the user applications filter out unwanted messages.

Are there any plans to tackle this before 9.0?


I haven't really been following this unfortunately but I see at least
part got done. (ifconfig)
Yes, it is committed as r223735 and r223741. Unfortunately this is not 
(directly) related to routing socket. kern/134931 still remains as it is.


is there anything we need to do before 9.0 that is small but would make
a big difference?
(i.e. fixes, tweaks)

rtsock is a great candidate :)


Julian

One thing that I haven't done and I only recently remembered, was the
ability to have a socket inherit
it's fib from the incoming connection SYN instead of from the socket
opening process.
It is a very good idea to have such possibility but it has to be 
controlled at least by some sort of sysctl or even per-socket ioctl 
(turned off by default)

(at least I am pretty sure I never got that done. (must go check)).



Thanks,
Vlad

[1]
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134931___

freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to
"freebsd-hackers-unsubscr...@freebsd.org"







___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: FIB separation

2011-07-16 Thread Alexander V. Chernikov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hiroki Sato wrote:
> Vlad Galu  wrote
>   in :
> 
> du> Hello,
> du>
> du> A couple of years ago, Stef Walter proposed a patch[1] that enforced
> du> the scope of routing messages. The general consesus was that the best
> du> approach would be the OpenBSD way - transporting the FIB number in the
> du> message and letting the user applications filter out unwanted
> du> messages.
> du>
> du> Are there any plans to tackle this before 9.0?
> 
>  I am looking into this and investigating other possible extensions in
>  rtsock messages such as addition of a fib member to rt_msghdr.  I am
>  not sure it can be done before 9.0, though...
Actually there were an off-list discussion with bz@ and julian@ about
interface fibs and rtsock changes several weeks ago.

Initial messages:
http://lists.freebsd.org/pipermail/freebsd-net/2011-June/029040.html

I've got 3 different patches:
1) straight forwarded kern/134931 fix (no fib in rtsock, no breaking
ABI, send to bz@)
2) adding fib in rtsock with rtsock versioning and other ABI keeping tricks
3) adding special RTA which can contain TLV pairs, with single defined
TLV with routing socket

As a result of discussion, first patch was sent to bz@. Since patches
from kern/134931 are outdated attaching it here.

It is very much like original patch from kern/134931. The only
difference is using PACKET_TAG_RTSOCKFAM mbuf_tag more heavily.
This is required for keeping raw_input() with same number of parameters.
Actually it looks rather hackish now.


> 
> -- Hiroki

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4hugsACgkQwcJ4iSZ1q2nd1gCcDAgOIEjNbunK9QeADDEvyMa8
WtYAn1rlwUMzeSh1nX8o7Pw5TpZsfCJx
=TsVz
-END PGP SIGNATURE-
Index: netinet/in.c
===
--- netinet/in.c(revision 223741)
+++ netinet/in.c(working copy)
@@ -1009,7 +1009,7 @@ static void in_addralias_rtmsg(int cmd, struct in_
(struct sockaddr *)&target->ia_addr;
rt_newaddrmsg(cmd, 
  (struct ifaddr *)target,
- 0, &msg_rt);
+ 0, &msg_rt, RT_ALLFIBS);
RTFREE(pfx_ro.ro_rt);
}
return;
Index: net/route.c
===
--- net/route.c (revision 223741)
+++ net/route.c (working copy)
@@ -384,7 +384,7 @@ miss:
 */
bzero(&info, sizeof(info));
info.rti_info[RTAX_DST] = dst;
-   rt_missmsg(msgtype, &info, 0, err);
+   rt_missmsg(msgtype, &info, 0, err, fibnum);
}   
 done:
if (newrt)
@@ -609,7 +609,7 @@ out:
info.rti_info[RTAX_GATEWAY] = gateway;
info.rti_info[RTAX_NETMASK] = netmask;
info.rti_info[RTAX_AUTHOR] = src;
-   rt_missmsg(RTM_REDIRECT, &info, flags, error);
+   rt_missmsg(RTM_REDIRECT, &info, flags, error, fibnum);
if (ifa != NULL)
ifa_free(ifa);
 }
@@ -1522,7 +1522,7 @@ rtinit1(struct ifaddr *ifa, int cmd, int flags, in
}
RT_ADDREF(rt);
RT_UNLOCK(rt);
-   rt_newaddrmsg(cmd, ifa, error, rt);
+   rt_newaddrmsg(cmd, ifa, error, rt, fibnum);
RT_LOCK(rt);
RT_REMREF(rt);
if (cmd == RTM_DELETE) {
Index: net/route.h
===
--- net/route.h (revision 223741)
+++ net/route.h (working copy)
@@ -303,6 +303,11 @@ struct rt_addrinfo {
struct  ifnet *rti_ifp;
 };
 
+struct rt_dispatch_ctx {
+   unsigned short family;  /* Socket family */
+   intfibnum;  /* FIB for message or -1 for all */
+}; 
+
 /*
  * This macro returns the size of a struct sockaddr when passed
  * through a routing socket. Basically we round up sa_len to
@@ -317,6 +322,8 @@ struct rt_addrinfo {
 
 #ifdef _KERNEL
 
+#define RT_ALLFIBS -1
+
 #define RT_LINK_IS_UP(ifp) (!((ifp)->if_capabilities & IFCAP_LINKSTATE) \
 || (ifp)->if_link_state == LINK_STATE_UP)
 
@@ -364,8 +371,8 @@ struct ifmultiaddr;
 voidrt_ieee80211msg(struct ifnet *, int, void *, size_t);
 voidrt_ifannouncemsg(struct ifnet *, int);
 voidrt_ifmsg(struct ifnet *);
-voidrt_missmsg(int, struct rt_addrinfo *, int, int);
-voidrt_newaddrmsg(int, struct ifaddr *, int, struct rtentry *);
+voidrt_missmsg(int, struct rt_addrinfo *, int, int, int);
+voidrt_newaddrmsg(int, struct ifaddr *, int, struct rtentry *, int);
 voidrt_newmaddrmsg(int, struct ifmultiaddr *);
 int rt_setgate(struct rtentry *, struct sockaddr *, struct sockaddr *);
 voidrt_maskedcopy(struct sockaddr *, struct sockaddr

Re: [PATCH] Remove dead code in netstat from route.c

2011-07-12 Thread Alexander V. Chernikov

On 12.07.2011 11:10, Garrett Cooper wrote:

On Mon, Jul 11, 2011 at 11:16 PM, Alexander V. Chernikov
  wrote:

Garrett Cooper wrote:

Hi,
 While trying to determine how to print out routes via kvm for
net-snmp, I noticed that there's a chunk of code from the 4.4 BSD Lite
days that isn't executed in netstat as NewTree is always 0. The
following patch removes that dead code and gets the FreeBSD source for
netstat more in line with NetBSD and OpenBSD's copy.
Thanks!


Hello!

This code is still working (I've tested it several months ago). Using
RT_DUMP sysctl gives us less information than KVM, but this is much more
  better way of requesting infromation:
KVM heavily assumes RADIX rtee is used and simply implement walking the
tree in userland (p_tree()) which is quite hackish. Since some dynamic
routing software can change massive amounts of data at once (BGP session
with full-view going up/down) or physical interface with several hundred
vlans goes up/down - requesting routing data via KVM can lead to fully
unexpected behaviour. Calling this on regular basis on net-snmp is not
the best thing one can do.

Additionally, there can be address families where RADIX is unnecessary
complicated since only direct key match is required (MPLS, for example).
Moving from RADIX implementation for such family will require a lot of
'if (af == AF_MPLS)' code in many base userland utilities since
assumption that RADIX is used do exists in many places, unfortunately.

Requesting routes via KVM is completely undocumented and kernel
internals dependent way. From the other side, NET_RT_DUMP sysctl is
documented in sysctl(3) and is used by all major routing software
(quagga, bird,openbgp). It also brings us more RADIX-dependent which
should be avoided.


 That's a compelling argument, but why is NewTree hardwired to 0
then (apart from the fact that kvm works with non-live kernel images)?

I don't know - It was too long ago :)
Walking RADIX directly gives an advantage of accessing its internal 
fields like "refcount" and "use" values which we are get used to see in

'netstat -rn' output, for example.


Thanks,
-Garrett



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [PATCH] Remove dead code in netstat from route.c

2011-07-11 Thread Alexander V. Chernikov
Garrett Cooper wrote:
> Hi,
> While trying to determine how to print out routes via kvm for
> net-snmp, I noticed that there's a chunk of code from the 4.4 BSD Lite
> days that isn't executed in netstat as NewTree is always 0. The
> following patch removes that dead code and gets the FreeBSD source for
> netstat more in line with NetBSD and OpenBSD's copy.
> Thanks!

Hello!

This code is still working (I've tested it several months ago). Using
RT_DUMP sysctl gives us less information than KVM, but this is much more
 better way of requesting infromation:
KVM heavily assumes RADIX rtee is used and simply implement walking the
tree in userland (p_tree()) which is quite hackish. Since some dynamic
routing software can change massive amounts of data at once (BGP session
with full-view going up/down) or physical interface with several hundred
vlans goes up/down - requesting routing data via KVM can lead to fully
unexpected behaviour. Calling this on regular basis on net-snmp is not
the best thing one can do.

Additionally, there can be address families where RADIX is unnecessary
complicated since only direct key match is required (MPLS, for example).
Moving from RADIX implementation for such family will require a lot of
'if (af == AF_MPLS)' code in many base userland utilities since
assumption that RADIX is used do exists in many places, unfortunately.

Requesting routes via KVM is completely undocumented and kernel
internals dependent way. From the other side, NET_RT_DUMP sysctl is
documented in sysctl(3) and is used by all major routing software
(quagga, bird,openbgp). It also brings us more RADIX-dependent which
should be avoided.



> -Garrett
> 
> 
> 
> 
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Is Elf formatdocumented anywhere?

2008-01-16 Thread Alexander V. Chernikov

Yuri wrote:

When I am trying to understand how Elf executable works I am only getting to few
pages with very fragmentary information.

Googling many constants like R_386_PC32, R_386_TLS_LD only yields some
discussion references and code.

Anybody knows where to read more about the Elf format? Does such document even
exist?

Yuri

You can look at http://www.skyfree.org/linux/references/ELF_Format.pdf


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"