Re: epair failure in production on 11.1-STABLE (r328930) ? weird!

2018-07-02 Thread k simon
When the host boots, it hints set the epair queue max len to 86016.  I 
did not see any related problem When I set it, it's stable.
But epair consumed cpu,  it's so easy to touch 100% cpu usage if it's 
passed high rate packets.

Simon
20180703

On 2018/7/3 06:16, Bjoern A. Zeeb wrote:
> On 2 Jul 2018, at 21:11, Dr Josef Karthauser wrote:
> 
>> We’re experiencing a strange issue in production failure with epair 
>> (which we’re using to talk vimage to jails).
>>
>> FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 
>> 16:05:59 GMT 2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64
>>
>> Looks like epair has suddenly stopped forwarding packets between the 
>> pair interfaces. Our server has been up for 82 days and it’s been 
>> working fine, but suddenly packets have stopped being forwarded 
>> between epairs across the entire system. (We’ve got around 30 epairs 
>> on the host).  So, we’ve got a sudden ARP resolution failure which is 
>> affecting all services. :(.
> 
> Ok, that’s a very interesting new observation I have not heard before or 
> missed.   You are saying that for about 30 epair pairs NONE is working 
> anymore?   All 30 are “dead”?
> 
> /bz
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: epair failure in production on 11.1-STABLE (r328930) ? weird!

2018-07-02 Thread Bjoern A. Zeeb

On 2 Jul 2018, at 21:11, Dr Josef Karthauser wrote:

We’re experiencing a strange issue in production failure with epair 
(which we’re using to talk vimage to jails).


FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 
16:05:59 GMT 2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64


Looks like epair has suddenly stopped forwarding packets between the 
pair interfaces. Our server has been up for 82 days and it’s been 
working fine, but suddenly packets have stopped being forwarded 
between epairs across the entire system. (We’ve got around 30 epairs 
on the host).  So, we’ve got a sudden ARP resolution failure which 
is affecting all services. :(.


Ok, that’s a very interesting new observation I have not heard before 
or missed.   You are saying that for about 30 epair pairs NONE is 
working anymore?   All 30 are “dead”?


/bz
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: epair failure in production on 11.1-STABLE (r328930) ? weird!

2018-07-02 Thread Kristof Provost

On 2 Jul 2018, at 23:11, Dr Josef Karthauser wrote:
Break break. We’ve just seen a bug bugzilla report 22710, reporting 
that epair fails when the queue limit is hit 
(net.link.epair.netisr_maxqlen). We’ve just introduced a high 
bandwidth service on this machine and so it’s probably that that’s 
what’s caused the issue.



I think you meant 227100 there.


But, why has hitting the queue limit broken it entirely!

It’s a bug in the epair code. Something’s wrong when it handles a 
queue overflow and it never leaves the overflow state, dropping all new 
packets instead.
I’m afraid that I’ve not been able to do anything about it yet. 
Bjoern is more familiar with that code I believe, and might be able to 
help.


Regards,
Kristof
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


epair failure in production on 11.1-STABLE (r328930) ? weird!

2018-07-02 Thread Dr Josef Karthauser
We’re experiencing a strange issue in production failure with epair (which 
we’re using to talk vimage to jails).

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 16:05:59 GMT 
2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64

Looks like epair has suddenly stopped forwarding packets between the pair 
interfaces. Our server has been up for 82 days and it’s been working fine, but 
suddenly packets have stopped being forwarded between epairs across the entire 
system. (We’ve got around 30 epairs on the host).  So, we’ve got a sudden ARP 
resolution failure which is affecting all services. :(.

Here’s the test. On a working machine this works fine:

# Create an email and put an IP address on it, so we can generate ARP 
traffic with PING. 
root@magnesium:/usr/home/systems # ifconfig epair create
epair7a
root@magnesium:/usr/home/systems # ifconfig epair7a up
root@magnesium:/usr/home/systems # ifconfig epair7b up
root@magnesium:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

# Generate ARP traffic over the epair… should see arp requests on 
epair7b.
root@magnesium:/usr/home/systems # ping 10.140.0.2
PING 10.140.0.2 (10.140.0.2): 56 data bytes

# Watch traffic coming in from the epair
root@magnesium:/usr/home/systems # tcpdump -i epair7b
10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 
28
10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 
28
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel

Works fine.

However, on the failing machine we don’t get any packets forwarded (any more — 
remember it’s been working fine for a few months - suddenly stopped working :( 
).

root@s5:/usr/home/systems # ifconfig pair create
epair19a
root@s5:/usr/home/systems # ifconfig epair19a up
root@s5:/usr/home/systems # ifconfig epair7b up
root@s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

root@s5:/usr/home/systems # ping 10.140.0.2
PING 10.140.0.2 (10.140.0.2): 56 data bytes

root@s5:/usr/home/systems # tcpdump -ni epair19a
09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 
28
09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 
28
^C 

root@s5:/usr/home/systems # tcpdump -ni epair19b
[Tumble weed - no traffic seen]
^C

Has anyone seen this before? We’re going to reboot and see if that fixes the 
problem.

The failing kernel in question is:

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 16:05:59 GMT 
2018 root@s5:/usr/obj/usr/src/sys/TRUESPEED  amd64


Break break. We’ve just seen a bug bugzilla report 22710, reporting that epair 
fails when the queue limit is hit (net.link.epair.netisr_maxqlen). We’ve just 
introduced a high bandwidth service on this machine and so it’s probably that 
that’s what’s caused the issue.

We’ve currently got a value of:

net.link.epair.netisr_maxqlen: 2100

root@s5:/usr/home/systems # netstat -Q
Configuration:
SettingCurrentLimit
Thread count 11
Default queue limit25610240
Dispatch policy direct  n/a
Threads bound to CPUs disabled  n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip 1256   flow  default   ---
igmp   2256 source  default   ---
rtsock 3256 source  default   ---
arp4256 source  default   ---
ether  5256 source   direct   ---
ip66256   flow  default   ---
epair  8   2100cpu  default   CD-

Workstreams:
WSID CPU   Name Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip 0   253 38546868900 49360754 434829441
   0   0   igmp   0 000000
   0   0   rtsock 0 5000 1144 1144
   0   0   arp0 0  5573045000  5573045
   0   0   ether  0 0 1125223166000 1125223166
   0   0   ip60 4   9000  1220274  1220364
   0   0   epair  0  210000  214 4994675481 4994675481

But we can’t see how much of the queue is currently being used, or what size we 
need to set it to.

But, why has hitting the queue limit broken it entirely! 

Help!

Cheers,
Joe
— 
Dr Josef Karthauser
Chief Technical Officer
(01225) 300371 / (07703) 596893
www.truespeed.com 
  / theTRUESPEED  
  @theTRUESPEED 
 
This email contains TrueSpeed information, which may be privileged or 
confidential. It's meant only for the individual(s) or entity named above. 

Re: Regarding latency of Netmap i/o

2018-07-02 Thread Luigi Rizzo
On Mon, Jul 2, 2018 at 1:34 AM, Suneet Singh  wrote:
> Dear Professor,
>
> I am Suneet from India. I am a PhD student in Unicamp, Campinas, Brazil. I
> am testing latency of MACSAD P4 software switch. I found that latency using
> netmap i/o is higher than using socket_mmap i/o while in case of dpdk,
> latency is very low.

please quantify the latency the type of experiment you run, the system's
configuration and the latency figures that you are seeing because 'higher'
and 'very low' mean nothing without context.

Batching is almost never a reason for significantly higher latency
unless it is misused.

Typically there are mistakes in setting up the experiment that
cause higher latency than expected

cheers
luigi

>
> As I read the research papers related to netmap, I found that it may be the
> reason of high latency in case of netmap compare to socket_mmap due to batch
> process.
>
> However, I am not sure why latency of my switch is higher using netmap
> compare to sockte mmap. I am requesting you to please let me know the
> performance of netmap interms of latency (end-to-end) compare to
> socket_mmap.
>
> Thank you so much for your precious time.
>
> " Man needs his difficulties because they are necessary to enjoy success."
>
> Kind Regards
> Suneet Kumar Singh
>
> https://www.researchgate.net/profile/Suneet_Singh7



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2217533   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"