[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Morten Brørup
Bruce,

Please reconsider your interpretation of the word "debuggability". Debugging is 
not only something that R staff does in a lab. Debuggability can also be 
interpreted as a network engineer's ability to debug what is happening in a 
production network.

Referring to the link you kindly provided (to the discussion on the OVF mailing 
list), in my eyes the context of the itemized requirements is a production 
environment, not a development environment. Daniele Di Proietto wrote:

>I think we can agree that there are a few rough spots that prevent it from 
>being easily deployed and used.

>I was hoping to get some feedback from the community about those rough spots, 
>i.e. areas where OVS+DPDK can/needs to improve to become more "production 
>ready" and user-friendly.


Med venlig hilsen / kind regards
- Morten Br?rup

-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 21. december 2015 16:40
To: Matthew Hall
Cc: Morten Br?rup; Kyle Larose; dev at dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:15:57PM -0500, Matthew Hall wrote:
> On Wed, Dec 16, 2015 at 11:56:11AM +, Bruce Richardson wrote:
> > Having this work with any application is one of our primary targets here. 
> > The app author should not have to worry too much about getting basic 
> > debug support. Even if it doesn't work at 40G small packet rates, 
> > you can get a lot of benefit from a scheme that provides functional 
> > debugging for an app.
> 
> I think my issue is that I don't think I buy into this particular set 
> of assumptions above.
> 
> I don't think a capture mechanism that doesn't work right in the real 
> use cases of the apps actually buys us much. If all we care about is 
> quickly dumping some frames to a pcap for occasional debugging, I 
> already have some C code for that I can donate which is a lot less 
> complicated than the trouble being proposed for "basic debug support". 
> Or we could use libpcap's equivalent... but it's quite a lot more complicated 
> than the code I have.
> 
> If we're going to assign engineers to this it's costing somebody a lot 
> of time and money. So I'd prefer to get them focused on something that 
> will always work even with high loads, such as real bpfjit support.
> 
> Matthew.

Hi,

I think it basic boils down to the fact that we are trying to solve different 
problems. Our current focus is the generic usability of all DPDK applications, 
as discussed at the DPDK Userspace Summit. Our plan is to provide some way to 
allow standard packet capture apps, such as tcpdump, to be used easily with 
DPDK. This is something also being looked for by folks such as those working on 
OVS e.g. called out at 
http://openvswitch.org/pipermail/dev/2015-August/058814.html

  "- Insight into the system and debuggability: nothing beats tcpdump for the
kernel datapath.  Can something similar be done for the userspace
datapath?

  - Consistency of the tools: some commands are slightly different for the
userspace/kernel datapath.  Ideally there shouldn't be any difference."

Providing libraries for packet capture at high packet rates is a related, but 
different problem, that we'll maybe look to investigate in the future - 
assuming that nobody else solves it first.

/Bruce



[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Gray, Mark D
> Bruce,
> 
> Please reconsider your interpretation of the word "debuggability".
> Debugging is not only something that R staff does in a lab. Debuggability
> can also be interpreted as a network engineer's ability to debug what is
> happening in a production network.

Is tcpdump used in large production cloud environments? I would have 
thought other less intrusive (and less manual) tools would be used? Isn't
that one of the benefits of SDN.



[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Gray, Mark D
> This is something also being looked for by folks such as those
> working on OVS e.g. called out at
> http://openvswitch.org/pipermail/dev/2015-August/058814.html
> 
>   "- Insight into the system and debuggability: nothing beats tcpdump for the
> kernel datapath.  Can something similar be done for the userspace
> datapath?
> 
>   - Consistency of the tools: some commands are slightly different for the
> userspace/kernel datapath.  Ideally there shouldn't be any difference."
> 

I had a painful experience with OVS-DPDK recently which may be representative
of a typical usability issue encountered. 

I was trying to connect two Openstack compute nodes together.  I had done
the configuration without DPDK first. It was easy to debug as I could use
tcpdump to look at the eth ports and see what type of traffic
was entering the compute node. I also needed to check if the traffic
was actually VxLAN traffic and what the VNI was in order to be able to
follow the traffic around the bridges in OVS. This all went quite well and
I was able to bring up my set up quite easily. 

Then I tried to set up the same thing with DPDK. I couldn't get traffic between
the compute nodes but I had no easy way to just dump the traffic coming into
(or out of) the compute node. Of course, there were some things I could do but,
for me, DPDK would be far more usable if I could just use tcpdump. As I know
DPDK to some extent, I can usually get around these problems but I suspect
that a new user to DPDK  would get very discouraged and frustrated by an 
experience like that. 

I'm not sure how often tcpdump is used in production environments but it is
very useful when debugging a live system without having to modify code. It 
would be
good if it could work at high rates and be really flexible but it probably makes
sense to focus on the basics first.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Bruce Richardson
On Wed, Dec 16, 2015 at 01:15:57PM -0500, Matthew Hall wrote:
> On Wed, Dec 16, 2015 at 11:56:11AM +, Bruce Richardson wrote:
> > Having this work with any application is one of our primary targets here. 
> > The app author should not have to worry too much about getting basic debug 
> > support. Even if it doesn't work at 40G small packet rates, you can get a 
> > lot of benefit from a scheme that provides functional debugging for an app. 
> 
> I think my issue is that I don't think I buy into this particular set of 
> assumptions above.
> 
> I don't think a capture mechanism that doesn't work right in the real use 
> cases of the apps actually buys us much. If all we care about is quickly 
> dumping some frames to a pcap for occasional debugging, I already have some C 
> code for that I can donate which is a lot less complicated than the trouble 
> being proposed for "basic debug support". Or we could use libpcap's 
> equivalent... but it's quite a lot more complicated than the code I have.
> 
> If we're going to assign engineers to this it's costing somebody a lot of 
> time 
> and money. So I'd prefer to get them focused on something that will always 
> work even with high loads, such as real bpfjit support.
> 
> Matthew.

Hi,

I think it basic boils down to the fact that we are trying to solve different
problems. Our current focus is the generic usability of all DPDK applications,
as discussed at the DPDK Userspace Summit. Our plan is to provide some way to
allow standard packet capture apps, such as tcpdump, to be used easily with
DPDK. This is something also being looked for by folks such as those working
on OVS e.g. called out at 
http://openvswitch.org/pipermail/dev/2015-August/058814.html

  "- Insight into the system and debuggability: nothing beats tcpdump for the
kernel datapath.  Can something similar be done for the userspace
datapath?

  - Consistency of the tools: some commands are slightly different for the
userspace/kernel datapath.  Ideally there shouldn't be any difference."

Providing libraries for packet capture at high packet rates is a related, but
different problem, that we'll maybe look to investigate in the future - assuming
that nobody else solves it first.

/Bruce


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Matthew Hall
On Mon, Dec 21, 2015 at 04:17:26PM +, Gray, Mark D wrote:
> Is tcpdump used in large production cloud environments? I would have 
> thought other less intrusive (and less manual) tools would be used? Isn't
> that one of the benefits of SDN.

tcpdump, tshark, wireshark, libpcap, etc. have been used every single place I 
ever worked, including in production under heavy load.

This is because nobody wants to redo the library of many tens of thousands of 
hours of protocol dissectors.

This is also why I am trying to point out what is required to get a solution 
that I am confident will really work when people are counting on it, which I 
am concerned the current proposals do not cover.

Matthew.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Bruce,

Matthew presented a very important point a few hours ago: We don't need tcpdump 
support for debugging the application in a lab; we already have plenty of other 
tools for debugging what we are developing. We need tcpdump support for 
debugging network issues in a production network.

In my "hardened network appliance" world, a solution designed purely for legacy 
applications (tcpdump, Wireshark etc.) is useless because the network 
technician doesn't have access to these applications on the appliance.

While a PC system running a DPDK based application might have plenty of spare 
lcores for filtering, the SmartShare appliances are already using all lcores 
for dedicated purposes, so the runtime filtering has to be done by the IO 
lcores (otherwise we would have to rehash everything and reallocate some lcores 
for mirroring, which I strongly oppose). Our non-DPDK firmware has also always 
been filtering directly in the fast path.

If the filter is so complex that it unexpectedly degrades the normal traffic 
forwarding performance, the mirror still reflects all the forwarded network 
traffic, not just some of it. In many real life network debugging scenarios 
this is better than the alternative: keeping the traffic forwarding up at full 
performance and having a network technician trying to understand a mirror 
output where some of the relevant packets are unexpectedly missing.

Although it is generally considered bad design if a system's behavior (or 
performance) changes unexpectedly when debugging features are being used, 
experienced network technicians have already grown accustomed to the 
performance of most non-trivial network equipment depending on the number of 
features enabled and how it is configured, so reality might beat theory here. 
(Still, other companies might prefer to keep their fast path performance 
unaffected and dedicate/reallocate some lcores for filtering.)

I am probably repeating myself here, but I would prefer if the DPDK provided 
the packet capturing framework in the form of a set of efficient libraries for 
1. BPF filtering (e.g. a simple BPF interpreter or a DPDK variant of bpfjit), 
2. scalable packet queueing for the mirrored packets (probably multi producer, 
single or multi consumer), as well as 3. high resolution time stamping 
(preferably easily convertible to the pcap file packet timestamp format). Then 
the DPDK application can take care of interfacing to the attached application 
and outputting the mirrored packets to the appropriate destination, e.g. a pcap 
file, a Wireshark excap named pipe, a dedicated RSPAN VLAN, or an ERSPAN 
tunnel. And an example application should show how to bind all this together in 
a tcpdump-like scenario for debugging a production network.

A note about timestamps: In theory, the captured packets should be time stamped 
as early as possible. In practice though, it is probably sufficiently accurate 
to time stamp the accepted packets after filtering, especially if they are 
filtered by an IO lcore. Alternatively, they can be time stamped when consumed 
from the mirror output queue.

A note about packet ordering: Mirrored packets belonging to different flows are 
probably out of order because of RSS, where multiple lcores contribute to the 
mirror output. This packet ordering inaccuracy could also serve as a reason for 
not being too strict about the accuracy of the timestamps on the mirrored 
packets.


Med venlig hilsen / kind regards
- Morten Br?rup


-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 16. december 2015 14:13
To: Morten Br?rup
Cc: Matthew Hall; Kyle Larose; dev at dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> Please note that tcpdump is a stupid name for a packet capture application 
> that supports much more than just TCP.
> 
> I had missed the point about ethdev supporting virtual interfaces, so thank 
> you for pointing that out. That covers my concerns about capturing packets 
> inside tunnels.
> 
> I will gladly admit that you Intel guys are probably much more competent in 
> the field of DPDK performance and scalability than I am. So Matthew and I 
> have been asking you to kindly ensure that your solution scales well at very 
> high packet rates too, and pointing out that filtering before copying is 
> probably cheaper than copying before filtering. You mention that it leads to 
> an important choice about which lcores get to do the work of filtering the 
> packets, so that might be worth some discussion.
> 
> :-)
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 

Thanks for your support.

We may look at having a certain amount of flexibility in the configuration of 
the setup, so as to avoid limiting the use of the functionality.

For scalability at very high packet rates, it's s

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Arnon Warshavsky
2 points from our experience in saving pcap files from a dpdk 10G fire hose:

1)
Our capture module provides a small "bit-vector" to the code that handles
the packets.
Since our packet processing code is already finding out basic stuff about
the packet traversing it (is it IPv4? v6?  is it TCP? is it fragmented?
..etc), it sets the relevant bits ON as it goes ,so that the capture module
can later quickly (mask against desired filters) decide if the a packet
needs to be captured.
Point is - when a capture layer exposes a slim API that lets it utilize
info coming from other modules , its easier and less expensive to handle
the fire hose.

2)
In many cases we are interested in capturing complete TCP flows, or at
least the first X packets of them.
In this case, A more expensive filter may be applied only on the SYN packet
and when matches, turns ON a bit on the tcp flow applicative context that
says we want to capture any packet falling under this tuple.
Point is - applicative filters at different costs are applied on different
packet types utilizing the mask from the previous bullet

Such a model should obviously need to be optional on a formal capture layer,
but when dealing with a fire hose - I find it very useful.

/Arnon

-

On Wed, Dec 16, 2015 at 12:45 PM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Mon, Dec 14, 2015 at 05:36:13PM -0500, Matthew Hall wrote:
> > On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> > > I've seen lots of ideas and options tossed around which would solve
> > > some or all of the above items, but nobody actually committing to
> > > anything. What can we do to actually agree on a solution to go and
> > > implement? I'm relatively new to the community, so I don't really know
> > > how this stuff works. Do people typically form a working group where
> > > they go off and discuss the problem, and then come back to the main
> > > community with a proposal? Or do people just submit RFCs independently
> > > with their own ideas?
> > >
> > > Thanks,
> > > Kyle
> >
> > I am getting the impression of a misplaced sense of urgency / panic. I
> don't
> > think anybody came up with a reason why we have to answer all these
> questions
> > tremendously quickly. It will take some more time, particularly with the
> > holidays, for the developers to finish the last bug fixes on the current
> > release before they have time to discuss 2.3 features.
> >
> > When that happens, someone working on DPDK full time will be identified
> as the
> > leader for the feature, that will lead the effort on PCAP, and help us
> > formulate the plan. Until then, what we really could use at this point
> is not
> > necessarily more writings and speculation, but an answer on some key tech
> > questions, particularly from some kernel guys:
> >
> > 1) How do we get the pcap filter string and/or BPF opcode vector from
> libpcap
> > / tcpdump / tshark / wireshark, into the DPDK application? There we can
> > compile it using the user-space bpfjit, so we can filter the packets at
> very
> > high speeds and not end up breaking everything doing a ton of stupid
> copies
> > when somebody does a capture of one flow on his i40e device or such.
> libpcap
> > is crappy about this, as it sends it all over syscalls which are always
> > assuming the kernel is on the other end, which is a bad assumption on
> their
> > part but many decades old and not so easy to fix.
> >
> > 2) How do we get the matched packets back out to the extcap or libpcap?
> From
> > what I saw extcap is tshark / wireshark only, which are 1) GPL licensed
> in
> > various ways, 2) not as widely used as libpcap. So using only extcap
> might be
> > kind of crappy.
> >
> > 3) For libpcap to work, maybe it will help if some of our kernel guys
> can help
> > us find out how to "detect" the kernel put a BPF capture filter onto a
> TUN /
> > TAP interface, and copy that filter to the DPDK app. Then, take any
> matched
> > packets and write them back onto the TUN / TAP. This would also be super
> > efficient and work with more off-the-shelf tools besides just tshark /
> > wireshark.
> >
> > If we don't find the answers for these items I don't think we have a
> path to a
> > working solution, forgetting about all the nice-to-have points such as UX
> > issues, troubleshooting, debugging, etc.
> >
> > Matthew.
>
> Hi,
>
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use
> of
>   the multi-process infrastructure in DPDK. A secondary process can attach
> to a
>   primary at runtime and provide the packet filtering and dumping
> capability.
> * ideally we want to create a generic packet mirroring callback inside the
> EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent
> via
>   an rte_ring (ring 

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Bruce,

Please note that tcpdump is a stupid name for a packet capture application that 
supports much more than just TCP.

I had missed the point about ethdev supporting virtual interfaces, so thank you 
for pointing that out. That covers my concerns about capturing packets inside 
tunnels.

I will gladly admit that you Intel guys are probably much more competent in the 
field of DPDK performance and scalability than I am. So Matthew and I have been 
asking you to kindly ensure that your solution scales well at very high packet 
rates too, and pointing out that filtering before copying is probably cheaper 
than copying before filtering. You mention that it leads to an important choice 
about which lcores get to do the work of filtering the packets, so that might 
be worth some discussion.

:-)

Med venlig hilsen / kind regards
- Morten Br?rup


-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 16. december 2015 12:56
To: Morten Br?rup
Cc: Matthew Hall; Kyle Larose; dev at dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> This doesn't really sound like tcpdump to me; it sounds like port mirroring.

It's actually a bit of both, in my opinion, it's designed to allow basic 
mirroring of traffic on a port to allow that traffic to be sent to a tcpdump 
destination.
By going with a more generic approach, we hope to enable more possible use 
cases than just focusing on TCP.


> 
> Your suggestion is limited to physical ports only, and cannot be attached 
> further inside the application, e.g. for mirroring packets related to a 
> specific VLAN.

Yes, the lack of attachment inside the app is a limitation. There are two types 
of scenarios that could be considered for packet capture:
* ones where the application can be modified to do it's own filtering and 
capturing.
* ones where you want a generic capture mechanism which can be used on any 
application without modification.
We have chosen to focus more on the second one, as that is where a generic 
solution for DPDK is likely to lie. For the first case, the application writer 
himself knows the type of traffic and how best to capture and filter it, so I 
don't think a generic one-size-fits-all solution is possible. [Though a couple 
of helper libraries may be of use]

As for physical ports, the scheme should work for any ethdev - why do you see 
it only being limited to physical ports? What would you want to see monitored 
that we are missing.

> 
> Furthermore, it doesn't sound like the filtering part scales well. Consider a 
> fully loaded 40 Gbit/s port. You would need to copy all packets into a single 
> rte_ring to the attached filtering process, which would then require its own 
> set of lcores to probably discard most of these packets when filtering. I 
> agree with Matthew that the filtering needs to happen as close to the source 
> as possible, and must be scalable to multiple lcores.

Without modifying the application itself to do it's own filtering I suspect 
scalability is always going to be a problem. That being said, there is no 
particular reason why a single rte_ring needs to be used - we could allow one 
ring per NIC queue for instance. The trouble with filtering at the source 
itself is that you put extra load on the IO cores. By using a ring, we put the 
filtering load on extra cores in a secondary process which can be scaled by the 
user without touching the main app.

> 
> On the positive side, your idea has the advantage that the filter can be any 
> application, and is not limited to BPF. However if the purpose is "tcpdump", 
> we should probably consider BPF, which is the type of filtering offered by 
> tcpdump.

Having this work with any application is one of our primary targets here. The 
app author should not have to worry too much about getting basic debug support.
Even if it doesn't work at 40G small packet rates, you can get a lot of benefit 
from a scheme that provides functional debugging for an app. Obviously, though 
we aim to make this as scalable as possible, which is why we want to allow 
fitlering in userspace before sending packets externally to DPDK.

> 
> I would prefer having a BPF library available that the application can use at 
> any point, either at the lowest level (when receiving/transmitting Ethernet 
> packets) or at a higher level (e.g. when working with packets that go into or 
> come out of a tunnel). The BPF library should implement packet length and 
> relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in 
> the mbuf.
> 
> Transferring a BPF filter from an outside application could be done by using 
> a simple text format, e.g. the output format of "tcpdump -ddd". This also 
> opens an easy roadmap for Wireshark integration by simply exten

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Matthew Hall
On Wed, Dec 16, 2015 at 11:56:11AM +, Bruce Richardson wrote:
> Having this work with any application is one of our primary targets here. 
> The app author should not have to worry too much about getting basic debug 
> support. Even if it doesn't work at 40G small packet rates, you can get a 
> lot of benefit from a scheme that provides functional debugging for an app. 

I think my issue is that I don't think I buy into this particular set of 
assumptions above.

I don't think a capture mechanism that doesn't work right in the real use 
cases of the apps actually buys us much. If all we care about is quickly 
dumping some frames to a pcap for occasional debugging, I already have some C 
code for that I can donate which is a lot less complicated than the trouble 
being proposed for "basic debug support". Or we could use libpcap's 
equivalent... but it's quite a lot more complicated than the code I have.

If we're going to assign engineers to this it's costing somebody a lot of time 
and money. So I'd prefer to get them focused on something that will always 
work even with high loads, such as real bpfjit support.

Matthew.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Bruce Richardson
On Wed, Dec 16, 2015 at 01:26:11PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> Please note that tcpdump is a stupid name for a packet capture application 
> that supports much more than just TCP.
> 
> I had missed the point about ethdev supporting virtual interfaces, so thank 
> you for pointing that out. That covers my concerns about capturing packets 
> inside tunnels.
> 
> I will gladly admit that you Intel guys are probably much more competent in 
> the field of DPDK performance and scalability than I am. So Matthew and I 
> have been asking you to kindly ensure that your solution scales well at very 
> high packet rates too, and pointing out that filtering before copying is 
> probably cheaper than copying before filtering. You mention that it leads to 
> an important choice about which lcores get to do the work of filtering the 
> packets, so that might be worth some discussion.
> 
> :-)
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 

Thanks for your support.

We may look at having a certain amount of flexibility in the configuration of
the setup, so as to avoid limiting the use of the functionality.

For scalability at very high packet rates, it's something we'll need you guys to
give us pointers on too - what's acceptable or not inside an app, and what
level of scalabilty is needed. I'd admit that most of our initial thinking in 
this
area was for debugging apps at less than line rate i.e. for functional testing.
For full line rate introspection, we'll have to see when we get some working 
code.

/Bruce

> 
> -Original Message-
> From: Bruce Richardson [mailto:bruce.richardson at intel.com] 
> Sent: 16. december 2015 12:56
> To: Morten Br?rup
> Cc: Matthew Hall; Kyle Larose; dev at dpdk.org
> Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3
> 
> On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br?rup wrote:
> > Bruce,
> > 
> > This doesn't really sound like tcpdump to me; it sounds like port mirroring.
> 
> It's actually a bit of both, in my opinion, it's designed to allow basic 
> mirroring of traffic on a port to allow that traffic to be sent to a tcpdump 
> destination.
> By going with a more generic approach, we hope to enable more possible use 
> cases than just focusing on TCP.
> 
> 
> > 
> > Your suggestion is limited to physical ports only, and cannot be attached 
> > further inside the application, e.g. for mirroring packets related to a 
> > specific VLAN.
> 
> Yes, the lack of attachment inside the app is a limitation. There are two 
> types of scenarios that could be considered for packet capture:
> * ones where the application can be modified to do it's own filtering and 
> capturing.
> * ones where you want a generic capture mechanism which can be used on any 
> application without modification.
> We have chosen to focus more on the second one, as that is where a generic 
> solution for DPDK is likely to lie. For the first case, the application 
> writer himself knows the type of traffic and how best to capture and filter 
> it, so I don't think a generic one-size-fits-all solution is possible. 
> [Though a couple of helper libraries may be of use]
> 
> As for physical ports, the scheme should work for any ethdev - why do you see 
> it only being limited to physical ports? What would you want to see monitored 
> that we are missing.
> 
> > 
> > Furthermore, it doesn't sound like the filtering part scales well. Consider 
> > a fully loaded 40 Gbit/s port. You would need to copy all packets into a 
> > single rte_ring to the attached filtering process, which would then require 
> > its own set of lcores to probably discard most of these packets when 
> > filtering. I agree with Matthew that the filtering needs to happen as close 
> > to the source as possible, and must be scalable to multiple lcores.
> 
> Without modifying the application itself to do it's own filtering I suspect 
> scalability is always going to be a problem. That being said, there is no 
> particular reason why a single rte_ring needs to be used - we could allow one 
> ring per NIC queue for instance. The trouble with filtering at the source 
> itself is that you put extra load on the IO cores. By using a ring, we put 
> the filtering load on extra cores in a secondary process which can be scaled 
> by the user without touching the main app.
> 
> > 
> > On the positive side, your idea has the advantage that the filter can be 
> > any application, and is not limited to BPF. However if the purpose is 
> > "tcpdump", we should probably consider BPF, which is the type of filtering 
> > offered by tcpdump.
> 
> Having this work with any application is one of our primary targets h

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Great idea, Arnon. Let?s look at existing use cases from the real world.



Our company makes network appliances. They are not running GNU/Linux or 
similar, so they do not offer a BASH prompt or any other BSD/Linux like command 
line interface.



Here?s a simplified description of how the user interacts with the packet 
capture feature in our appliances:



Our GUI allows you to input a filter, e.g. a MAC address, an IP address or a 
compiled BPF program as a single hexadecimal string (roughly ?tcpdump ?ddd? 
output), and start capturing. The captured packets can then be downloaded from 
the GUI in pcap format.



The other packet filters our appliance needs, e.g. DHCP, ARP etc., are not 
provided by the user (or by any other external interaction), but are hardcoded 
in C, just like any other part of our firmware.





Med venlig hilsen / kind regards



Morten Br?rup

CTO







SmartShare Systems A/S

Tonsbakken 16-18

DK-2740 Skovlunde

Denmark



Office  +45 70 20 00 93

Direct  +45 89 93 50 22

Mobile  +45 25 40 82 12



mb at smartsharesystems.com <mailto:mb at smartsharesystems.com> 

www.smartsharesystems.com <http://www.smartsharesystems.com/> 



From: Arnon Warshavsky [mailto:ar...@qwilt.com] 
Sent: 16. december 2015 12:37
To: Bruce Richardson
Cc: Matthew Hall; dev at dpdk.org; Morten Br?rup
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3



2 points from our experience in saving pcap files from a dpdk 10G fire hose:


1) 
Our capture module provides a small "bit-vector" to the code that handles the 
packets. 
Since our packet processing code is already finding out basic stuff about the 
packet traversing it (is it IPv4? v6?  is it TCP? is it fragmented? ..etc), it 
sets the relevant bits ON as it goes ,so that the capture module can later 
quickly (mask against desired filters) decide if the a packet needs to be 
captured.

Point is - when a capture layer exposes a slim API that lets it utilize info 
coming from other modules , its easier and less expensive to handle the fire 
hose.

2)

In many cases we are interested in capturing complete TCP flows, or at least 
the first X packets of them.

In this case, A more expensive filter may be applied only on the SYN packet and 
when matches, turns ON a bit on the tcp flow applicative context that says we 
want to capture any packet falling under this tuple.

Point is - applicative filters at different costs are applied on different 
packet types utilizing the mask from the previous bullet 



Such a model should obviously need to be optional on a formal capture layer,

but when dealing with a fire hose - I find it very useful.



/Arnon



[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Morten Brørup
Bruce,

This doesn't really sound like tcpdump to me; it sounds like port mirroring.

Your suggestion is limited to physical ports only, and cannot be attached 
further inside the application, e.g. for mirroring packets related to a 
specific VLAN.

Furthermore, it doesn't sound like the filtering part scales well. Consider a 
fully loaded 40 Gbit/s port. You would need to copy all packets into a single 
rte_ring to the attached filtering process, which would then require its own 
set of lcores to probably discard most of these packets when filtering. I agree 
with Matthew that the filtering needs to happen as close to the source as 
possible, and must be scalable to multiple lcores.

On the positive side, your idea has the advantage that the filter can be any 
application, and is not limited to BPF. However if the purpose is "tcpdump", we 
should probably consider BPF, which is the type of filtering offered by tcpdump.

I would prefer having a BPF library available that the application can use at 
any point, either at the lowest level (when receiving/transmitting Ethernet 
packets) or at a higher level (e.g. when working with packets that go into or 
come out of a tunnel). The BPF library should implement packet length and 
relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in the 
mbuf.

Transferring a BPF filter from an outside application could be done by using a 
simple text format, e.g. the output format of "tcpdump -ddd". This also opens 
an easy roadmap for Wireshark integration by simply extending excap to include 
such a BPF filter format.


Lots of negativity above. I very much like the idea of attaching the secondary 
process and going through an rte_ring. This allows the secondary process to 
pass the filtered and captured packets on in any format it likes to any 
destination it likes.


Med venlig hilsen / kind regards
- Morten Br?rup

-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 16. december 2015 11:45

Hi,

we are currently doing some investigation and prototyping for this feature.
Our current thinking is the following:
* to allow dynamic control of the filtering, we are thinking of making use of
  the multi-process infrastructure in DPDK. A secondary process can attach to a
  primary at runtime and provide the packet filtering and dumping capability.
* ideally we want to create a generic packet mirroring callback inside the EAL,
  that can be set up to mirror packets going through Rx/Tx on an ethdev.
* using this, packets being received on the port to be monitored are sent via
  an rte_ring (ring ethdev) to the secondary process which takes those packets
  and does any filtering on them. [This would be where BPF could fit into
  things, but it's not something we have looked at yet.]
* initially we plan to have the secondary process then write packets to a pcap
  file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
  or a TAP device PMD, those could be used as targets instead.

This implementation we hope should provide enough hooks to enable the standard 
tools to be used for monitoring and capturing packets. We will send out draft 
implementation code for various parts of this as soon as we have it.

Additional feedback welcome, as always. :-)

Regards,
/Bruce




[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Bruce Richardson
On Wed, Dec 16, 2015 at 12:40:43PM +0100, Morten Br?rup wrote:
> Bruce,
> 
> This doesn't really sound like tcpdump to me; it sounds like port mirroring.

It's actually a bit of both, in my opinion, it's designed to allow basic 
mirroring
of traffic on a port to allow that traffic to be sent to a tcpdump destination.
By going with a more generic approach, we hope to enable more possible use
cases than just focusing on TCP.


> 
> Your suggestion is limited to physical ports only, and cannot be attached 
> further inside the application, e.g. for mirroring packets related to a 
> specific VLAN.

Yes, the lack of attachment inside the app is a limitation. There are two types
of scenarios that could be considered for packet capture:
* ones where the application can be modified to do it's own filtering and
capturing.
* ones where you want a generic capture mechanism which can be used on any
application without modification.
We have chosen to focus more on the second one, as that is where a generic
solution for DPDK is likely to lie. For the first case, the application writer
himself knows the type of traffic and how best to capture and filter it, so I
don't think a generic one-size-fits-all solution is possible. [Though a couple
of helper libraries may be of use]

As for physical ports, the scheme should work for any ethdev - why do you see
it only being limited to physical ports? What would you want to see monitored
that we are missing.

> 
> Furthermore, it doesn't sound like the filtering part scales well. Consider a 
> fully loaded 40 Gbit/s port. You would need to copy all packets into a single 
> rte_ring to the attached filtering process, which would then require its own 
> set of lcores to probably discard most of these packets when filtering. I 
> agree with Matthew that the filtering needs to happen as close to the source 
> as possible, and must be scalable to multiple lcores.

Without modifying the application itself to do it's own filtering I suspect
scalability is always going to be a problem. That being said, there is no
particular reason why a single rte_ring needs to be used - we could allow one
ring per NIC queue for instance. The trouble with filtering at the source itself
is that you put extra load on the IO cores. By using a ring, we put the 
filtering
load on extra cores in a secondary process which can be scaled by the user 
without
touching the main app.

> 
> On the positive side, your idea has the advantage that the filter can be any 
> application, and is not limited to BPF. However if the purpose is "tcpdump", 
> we should probably consider BPF, which is the type of filtering offered by 
> tcpdump.

Having this work with any application is one of our primary targets here. The
app author should not have to worry too much about getting basic debug support.
Even if it doesn't work at 40G small packet rates, you can get a lot of benefit
from a scheme that provides functional debugging for an app. Obviously, though
we aim to make this as scalable as possible, which is why we want to allow 
fitlering
in userspace before sending packets externally to DPDK.

> 
> I would prefer having a BPF library available that the application can use at 
> any point, either at the lowest level (when receiving/transmitting Ethernet 
> packets) or at a higher level (e.g. when working with packets that go into or 
> come out of a tunnel). The BPF library should implement packet length and 
> relevant ancillary data, such as SKF_AD_VLAN_TAG etc. based on metadata in 
> the mbuf.
> 
> Transferring a BPF filter from an outside application could be done by using 
> a simple text format, e.g. the output format of "tcpdump -ddd". This also 
> opens an easy roadmap for Wireshark integration by simply extending excap to 
> include such a BPF filter format.
> 
> 
> Lots of negativity above. I very much like the idea of attaching the 
> secondary process and going through an rte_ring. This allows the secondary 
> process to pass the filtered and captured packets on in any format it likes 
> to any destination it likes.

Good, so we're not completely off-base here. :-)

/Bruce

> 
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 
> -Original Message-
> From: Bruce Richardson [mailto:bruce.richardson at intel.com] 
> Sent: 16. december 2015 11:45
> 
> Hi,
> 
> we are currently doing some investigation and prototyping for this feature.
> Our current thinking is the following:
> * to allow dynamic control of the filtering, we are thinking of making use of
>   the multi-process infrastructure in DPDK. A secondary process can attach to 
> a
>   primary at runtime and provide the packet filtering and dumping capability.
> * ideally we want to create a generic packet mirroring callback inside the 
> EAL,
>   that can be set up to mirror packets going through Rx/Tx on an ethdev.
> * using this, packets being received on the port to be monitored are sent via
>   an rte_ring (ring ethdev) to the secondary 

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-16 Thread Bruce Richardson
On Mon, Dec 14, 2015 at 05:36:13PM -0500, Matthew Hall wrote:
> On Mon, Dec 14, 2015 at 04:29:41PM -0500, Kyle Larose wrote:
> > I've seen lots of ideas and options tossed around which would solve
> > some or all of the above items, but nobody actually committing to
> > anything. What can we do to actually agree on a solution to go and
> > implement? I'm relatively new to the community, so I don't really know
> > how this stuff works. Do people typically form a working group where
> > they go off and discuss the problem, and then come back to the main
> > community with a proposal? Or do people just submit RFCs independently
> > with their own ideas?
> > 
> > Thanks,
> > Kyle
> 
> I am getting the impression of a misplaced sense of urgency / panic. I don't 
> think anybody came up with a reason why we have to answer all these questions 
> tremendously quickly. It will take some more time, particularly with the 
> holidays, for the developers to finish the last bug fixes on the current 
> release before they have time to discuss 2.3 features.
> 
> When that happens, someone working on DPDK full time will be identified as 
> the 
> leader for the feature, that will lead the effort on PCAP, and help us 
> formulate the plan. Until then, what we really could use at this point is not 
> necessarily more writings and speculation, but an answer on some key tech 
> questions, particularly from some kernel guys:
> 
> 1) How do we get the pcap filter string and/or BPF opcode vector from libpcap 
> / tcpdump / tshark / wireshark, into the DPDK application? There we can 
> compile it using the user-space bpfjit, so we can filter the packets at very 
> high speeds and not end up breaking everything doing a ton of stupid copies 
> when somebody does a capture of one flow on his i40e device or such. libpcap 
> is crappy about this, as it sends it all over syscalls which are always 
> assuming the kernel is on the other end, which is a bad assumption on their 
> part but many decades old and not so easy to fix.
> 
> 2) How do we get the matched packets back out to the extcap or libpcap? From 
> what I saw extcap is tshark / wireshark only, which are 1) GPL licensed in 
> various ways, 2) not as widely used as libpcap. So using only extcap might be 
> kind of crappy.
> 
> 3) For libpcap to work, maybe it will help if some of our kernel guys can 
> help 
> us find out how to "detect" the kernel put a BPF capture filter onto a TUN / 
> TAP interface, and copy that filter to the DPDK app. Then, take any matched 
> packets and write them back onto the TUN / TAP. This would also be super 
> efficient and work with more off-the-shelf tools besides just tshark / 
> wireshark.
> 
> If we don't find the answers for these items I don't think we have a path to 
> a 
> working solution, forgetting about all the nice-to-have points such as UX 
> issues, troubleshooting, debugging, etc.
> 
> Matthew.

Hi,

we are currently doing some investigation and prototyping for this feature.
Our current thinking is the following:
* to allow dynamic control of the filtering, we are thinking of making use of
  the multi-process infrastructure in DPDK. A secondary process can attach to a
  primary at runtime and provide the packet filtering and dumping capability.
* ideally we want to create a generic packet mirroring callback inside the EAL,
  that can be set up to mirror packets going through Rx/Tx on an ethdev.
* using this, packets being received on the port to be monitored are sent via
  an rte_ring (ring ethdev) to the secondary process which takes those packets
  and does any filtering on them. [This would be where BPF could fit into
  things, but it's not something we have looked at yet.]
* initially we plan to have the secondary process then write packets to a pcap
  file using a pcap PMD, but down the road if we get other PMDs, like a KNI PMD
  or a TAP device PMD, those could be used as targets instead.

This implementation we hope should provide enough hooks to enable the standard
tools to be used for monitoring and capturing packets. We will send out draft
implementation code for various parts of this as soon as we have it.

Additional feedback welcome, as always. :-)

Regards,
/Bruce



[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-14 Thread Thomas Monjalon
2015-12-14 10:45, Aaron Conole:
> After all, it's a networking stack, right?

No, not currently.
DPDK allows to build some specific lightweight or more complete stacks.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-14 Thread Kyle Larose
On Mon, Dec 14, 2015 at 2:17 PM, Aaron Conole  wrote:

> No need to donate to the cause on this one, I think :) The issues
> surrounding tcpdump are, imo, ones of library/application workflow. HOW
> does the user enable tcpdump-like support? The current option is to
> start up with a pcap PMD configured, capture to a file for a bit, then
> stop. I think the issues being discussed are what other options to give
> the user. Then again, I may have my signals crossed somewhere.
>

I don't think you're crossing signals on giving options to users.
However, I think we're discussing more than just high level UI
options; we're getting into the details internal to any application
involved in capturing packets. While it's great to give options to the
user, we still need to get the captured packets to them. This poses a
few challenges, since we need to do it with low impact(e.g. don't just
write the packet to the HDD in the main packet processing loop), while
not hammering the system with a crazy flood that takes down the kernel
(copy everything to into some critical task). Both of these have been
discussed in earlier threads/earlier in this thread.

To me, these challenges boil down to:
1) Balancing a nice generic output interface with the most efficient
way to get packets out of the application .
2) Filtering as close to the capture point as possible.

Putting that together with giving options, we need to:
1) Give the users a convenient API to start a capture and provide a filter.
2) Balance a nice generic output interface with the most efficient way
to get packets out of the application.
3) Filter as close to the capture point as possible.

I've seen lots of ideas and options tossed around which would solve
some or all of the above items, but nobody actually committing to
anything. What can we do to actually agree on a solution to go and
implement? I'm relatively new to the community, so I don't really know
how this stuff works. Do people typically form a working group where
they go off and discuss the problem, and then come back to the main
community with a proposal? Or do people just submit RFCs independently
with their own ideas?

Thanks,

Kyle


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-14 Thread Aaron Conole
Matthew Hall  writes:
>> The pcap file format contains a header in front of each packet, which is 
>> extremely simple. But it has a timestamp (which uses 32 bit for tv_sec and 
>> tv_usec in files), so it needs to be considered how to handle this 
>> efficiently.
>
> I already wrote some C code for generating the original pcap format files a 
> while ago which I think could be donated. For the timestamps to work at 
> highest efficiency we'd need to run an rte_timer every X microseconds that 
> updates a global volatile copy of tv_sec and tv_usec.
>
> Or make some code that calculates the offset of rte_rdtsc from 01 January 
> 1970 
> 00:00:00 UTC and uses TSC value to generate the right tv_sec and tv_usec 
> would 
> also work fine.

Why not just use libpcap to write out pcap files? I bet it does a better
job that any of us will ;) It's BSD licensed, so there should be no
issues with linking against it (DPDK currently does for the pcap PMD), and
it supports both pcap and pcap-ng (although -ng support may not be 100%,
I expect it will get better).

No need to donate to the cause on this one, I think :) The issues
surrounding tcpdump are, imo, ones of library/application workflow. HOW
does the user enable tcpdump-like support? The current option is to
start up with a pcap PMD configured, capture to a file for a bit, then
stop. I think the issues being discussed are what other options to give
the user. Then again, I may have my signals crossed somewhere.

-Aaron

> Matthew.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-14 Thread Matthew Hall
FYI your last name comes in as a corrupt character for me. You might have to 
think about converting it from ISO 8859-1 / 8859-15 to UTF-8.

On Mon, Dec 14, 2015 at 10:57:10AM +0100, Morten B wrote:
> Check out the new "extcap" feature of Wireshark. It uses named pipes for the 
> packets, already mentioned by Stephen Hemminger.

I looked at it a bit. I wasn't 100% clear if there is a way to pass down the 
BPF expression for compilation and usage inside the DPDK application.

> Tcpdump is an open source application, so it should be possible to define an 
> efficient interface between DPDK and tcpdump, and implement it in both DPDK 
> and tcpdump. The same goes for libpcap.

Easier said than done. A whole ton of libpcap assumes it's talking to a very 
specific kernel interface, and the code is quite complicated.

> It possibly also has a secondary feature: passing a BPF program 
> from tcpdump/libpcap to DPDK, so packets can be filtered in DPDK and don't 
> need to be passed on to tcpdump/libpcap.

If we can figure out how to get this feature to work in extcap, I think that 
will be the winning solution by far.

> [A]dd a BPF library (librte_bpf) to DPDK, preferably with a compiler. The 
> application initially calls the library's BPF compiler function once with 
> the BPF program to compile it, and in the fast path the application calls a 
> library function that takes an mbuf and the compiled BPF program and returns 
> an integer value indicating how many bytes of the packet should be mirrored 
> by the capturing application. +1 to Matthew Hall for taking this direction!

Yes, performance wise I think this is the only way that will really work 100% 
of the time. Otherwise I think we end up in the very bad situation where the 
guy who tries to make a capture of a single flow for debugging on i40e ends up 
crashing his system or dropping all his traffic when the capture system 
unhelpfully redirects a storm of unfiltered traffic outside of DPDK to KNI or 
some pipe devices or another place it does not belong.

There is one complexity though... the list of BPF filters should probably be a 
linked list, where they get added and removed, or you can't do > 1 filter at a 
time. I know how to code some of this stuff but I only work on DPDK in my 
spare time so I don't have the cycles to do all of the work.

> The pcap file format contains a header in front of each packet, which is 
> extremely simple. But it has a timestamp (which uses 32 bit for tv_sec and 
> tv_usec in files), so it needs to be considered how to handle this 
> efficiently.

I already wrote some C code for generating the original pcap format files a 
while ago which I think could be donated. For the timestamps to work at 
highest efficiency we'd need to run an rte_timer every X microseconds that 
updates a global volatile copy of tv_sec and tv_usec.

Or make some code that calculates the offset of rte_rdtsc from 01 January 1970 
00:00:00 UTC and uses TSC value to generate the right tv_sec and tv_usec would 
also work fine.

Matthew.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-14 Thread Stephen Hemminger
On Mon, 14 Dec 2015 13:29:31 -0500
Matthew Hall  wrote:

> FYI your last name comes in as a corrupt character for me. You might have to 
> think about converting it from ISO 8859-1 / 8859-15 to UTF-8.
> 
> On Mon, Dec 14, 2015 at 10:57:10AM +0100, Morten B wrote:
> > Check out the new "extcap" feature of Wireshark. It uses named pipes for 
> > the 
> > packets, already mentioned by Stephen Hemminger.
> 
> I looked at it a bit. I wasn't 100% clear if there is a way to pass down the 
> BPF expression for compilation and usage inside the DPDK application.
> 
> > Tcpdump is an open source application, so it should be possible to define 
> > an 
> > efficient interface between DPDK and tcpdump, and implement it in both DPDK 
> > and tcpdump. The same goes for libpcap.
> 
> Easier said than done. A whole ton of libpcap assumes it's talking to a very 
> specific kernel interface, and the code is quite complicated.
> 
> > It possibly also has a secondary feature: passing a BPF program 
> > from tcpdump/libpcap to DPDK, so packets can be filtered in DPDK and don't 
> > need to be passed on to tcpdump/libpcap.
> 
> If we can figure out how to get this feature to work in extcap, I think that 
> will be the winning solution by far.
> 
> > [A]dd a BPF library (librte_bpf) to DPDK, preferably with a compiler. The 
> > application initially calls the library's BPF compiler function once with 
> > the BPF program to compile it, and in the fast path the application calls a 
> > library function that takes an mbuf and the compiled BPF program and 
> > returns 
> > an integer value indicating how many bytes of the packet should be mirrored 
> > by the capturing application. +1 to Matthew Hall for taking this direction!

There are already several BPF libraries available. I would prefer DPDK not
start copying existing code.


[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-14 Thread Aaron Conole
Morten Br?rup  writes:
> I noticed a discussion about support for tcpdump in DPDK 2.3.
>
>  
>
> Please consider which scenarios you want to support:

Morten,

Thanks for your input here. I think there's a different way of
approaching this: "debuggability" (sorry, it's not grammatical).

The end goal of having tcpdump is not just for another feature checklist
that folks can just say "okay, welp we got that too!" When something is
going wrong with communications, being able to fire up tcpdump without
disturbing anything else is hugely important to isolating issues. I
think that's an important scenario, and may be enabled by one or more of
the features you've listed.

There are other scenarios as well, that you hinted at - using existing
applications built around libpcap. That is important to enable as well,
but I think the biggest hurdle to getting anyone to use a DPDK enabled
application will always be: "How much work do I have to do when
something goes wrong?"

There are certainly things that should belong in an application. But I
think easy enabling of a tcpdump capable mechanism is DPDK's
responsibility. After all, it's a networking stack, right?

Whichever combination of features is used, we shouldn't really
discourage them, I think. Any way the user can debug something using
familiar workflows and tools is a way that dpdk-dev doesn't need to get
involved.

Just my $.02, anyway.

Thanks,
-Aaron