Re: [vpp-dev] VPP Graph Optimization

2018-01-26 Thread David Bainbridge
Very helpful and interesting. Thank you.

On Fri, Jan 26, 2018 at 11:16 AM Dave Barach (dbarach) <dbar...@cisco.com>
wrote:

> Dear David,
>
>
>
> A bit of history. We worked on vpp for a decade before making any serious
> effort to multi-thread it. The first scheme that I tried was to break up
> the graph into reconfigurable pipeline stages. Effective partitioning of
> the graph is highly workload-dependent, and it can change in a heartbeat.
> the resulting system runs at the speed of the slowest pipeline stage.
>
>
>
> In terms of easily measured inter-thread handoff cost, it’s not awful. 2-3
> clocks/pkt. Handing vectors of packets between threads can cause a festival
> of cache coherence traffic, and it can easily undo the positive effects of
> ddio (packet data DMA into the cache hierarchy).
>
>
>
> We actually use the scheme you describe in a very fine-grained way: dual
> and quad loop graph dispatch functions process 2 or 4 packets at the same
> time. Until we run out of registers, a superscalar CPU can “do the same
> thing to 2 or 4 packets at the same time” pretty effectively. Including
> memory hierarchy stalls, vpp averages more than two instructions retired
> per clock cycle.
>
>
>
> At the graph node level, I can’t see how to leverage this technique.
> Presenting [identical] vectors to 2 (or more) nodes running on multiple
> threads would mean (1) the parallelized subgraph would run at the speed of
> the slowest node. (2) you’d pay the handoff costs already discussed above,
> and (3) you’d need an expensive algorithm to make sure that all vector
> replicas were finished before reentering sequential processing. (4) None of
> the graph nodes we’ve ever constructed are free of ordering constraints.
> Every node alters packet state in a meaningful way, or they wouldn’t be
> worth having. ()…
>
>
>
> We’ve had considerable success with flow-hashing across a set of identical
> graph replicas [worker threads], even when available hardware RSS hashing
> is not useful [think about NATted UDP traffic].
>
>
>
> Hope this is of some interest.
>
>
>
> Thanks… Dave
>
>
>
> *From:* vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] *On
> Behalf Of *David Bainbridge
> *Sent:* Friday, January 26, 2018 12:39 PM
> *To:* vpp-dev@lists.fd.io
> *Subject:* [vpp-dev] VPP Graph Optimization
>
>
>
> I have just started to read up on VPP/FD.io, and I have a question about
> graph optimization and was wondering if (as I suspect) this has already
> been thought about and either planned or decided against.
>
>
>
> The documentation I found on VPP essentially says that VPP uses batch
> processing and processes all packets in a vector on one step before
> proceeding to the next step. The claim is this provides overall better
> throughput because of instruction caching.
>
>
>
> I was wondering if optimization of the graph to understand where
> concurrency can be leveraged has been considered, as well as where you
> could process the vector by two steps with an offset. If this is possible,
> then steps could be pinned to cores and perhaps both concurrency and
> instruction caching could be leveraged.
>
>
>
> For example assume the following graph:
>
>
>
> [image: image.png]
>
>
>
> In this graph, steps B,C can be done concurrently as they don't "modify"
> the vector. Steps D, E can't be done concurrently, but as they don't
> require look back/forward they can be done in offset.
>
>
>
> What I am suggesting is, if there are enough cores, then steps could be
> pinned to cores to achieve the benefits of instruction caching, and after
> step A is complete, steps B,C could be done concurrently. After B,C are
> complete then D can be started and as D completes processing on a packet if
> can then be processed by E (i.e., the entire vector does not need to be
> processed by D before processing by E is started).
>
>
>
> I make no argument that this doesn't increase complexity and also
> introduces coordination costs that don't exists today. To be fair, offset
> processing could be viewed as splitting the original large vector into
> smaller vectors and processing the smaller vectors from start to finish
> (almost dynamic optimization based on dynamic vector resizing).
>
> Just curious to hear others thoughts and if some of this has been thought
> through or experimented with. As I said, just thinking off the cuff and
> wondering; not fully thought through.
>
>
>
> avèk respè,
>
> /david
>
>
>
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] VPP Graph Optimization

2018-01-26 Thread Dave Barach (dbarach)
Dear David,

 

A bit of history. We worked on vpp for a decade before making any serious 
effort to multi-thread it. The first scheme that I tried was to break up the 
graph into reconfigurable pipeline stages. Effective partitioning of the graph 
is highly workload-dependent, and it can change in a heartbeat. the resulting 
system runs at the speed of the slowest pipeline stage.

 

In terms of easily measured inter-thread handoff cost, it’s not awful. 2-3 
clocks/pkt. Handing vectors of packets between threads can cause a festival of 
cache coherence traffic, and it can easily undo the positive effects of ddio 
(packet data DMA into the cache hierarchy).

 

We actually use the scheme you describe in a very fine-grained way: dual and 
quad loop graph dispatch functions process 2 or 4 packets at the same time. 
Until we run out of registers, a superscalar CPU can “do the same thing to 2 or 
4 packets at the same time” pretty effectively. Including memory hierarchy 
stalls, vpp averages more than two instructions retired per clock cycle.

 

At the graph node level, I can’t see how to leverage this technique. Presenting 
[identical] vectors to 2 (or more) nodes running on multiple threads would mean 
(1) the parallelized subgraph would run at the speed of the slowest node. (2) 
you’d pay the handoff costs already discussed above, and (3) you’d need an 
expensive algorithm to make sure that all vector replicas were finished before 
reentering sequential processing. (4) None of the graph nodes we’ve ever 
constructed are free of ordering constraints. Every node alters packet state in 
a meaningful way, or they wouldn’t be worth having. ()… 

 

We’ve had considerable success with flow-hashing across a set of identical 
graph replicas [worker threads], even when available hardware RSS hashing is 
not useful [think about NATted UDP traffic]. 

 

Hope this is of some interest.

 

Thanks… Dave

 

From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of David Bainbridge
Sent: Friday, January 26, 2018 12:39 PM
To: vpp-dev@lists.fd.io
Subject: [vpp-dev] VPP Graph Optimization

 

I have just started to read up on VPP/FD.io, and I have a question about graph 
optimization and was wondering if (as I suspect) this has already been thought 
about and either planned or decided against.

 

The documentation I found on VPP essentially says that VPP uses batch 
processing and processes all packets in a vector on one step before proceeding 
to the next step. The claim is this provides overall better throughput because 
of instruction caching.

 

I was wondering if optimization of the graph to understand where concurrency 
can be leveraged has been considered, as well as where you could process the 
vector by two steps with an offset. If this is possible, then steps could be 
pinned to cores and perhaps both concurrency and instruction caching could be 
leveraged.

 

For example assume the following graph:

 



 

In this graph, steps B,C can be done concurrently as they don't "modify" the 
vector. Steps D, E can't be done concurrently, but as they don't require look 
back/forward they can be done in offset.

 

What I am suggesting is, if there are enough cores, then steps could be pinned 
to cores to achieve the benefits of instruction caching, and after step A is 
complete, steps B,C could be done concurrently. After B,C are complete then D 
can be started and as D completes processing on a packet if can then be 
processed by E (i.e., the entire vector does not need to be processed by D 
before processing by E is started).

 

I make no argument that this doesn't increase complexity and also introduces 
coordination costs that don't exists today. To be fair, offset processing could 
be viewed as splitting the original large vector into smaller vectors and 
processing the smaller vectors from start to finish (almost dynamic 
optimization based on dynamic vector resizing).

Just curious to hear others thoughts and if some of this has been thought 
through or experimented with. As I said, just thinking off the cuff and 
wondering; not fully thought through.

 

avèk respè,

/david

 



smime.p7s
Description: S/MIME cryptographic signature
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

[vpp-dev] VPP Graph Optimization

2018-01-26 Thread David Bainbridge
I have just started to read up on VPP/FD.io, and I have a question about
graph optimization and was wondering if (as I suspect) this has already
been thought about and either planned or decided against.


The documentation I found on VPP essentially says that VPP uses batch
processing and processes all packets in a vector on one step before
proceeding to the next step. The claim is this provides overall better
throughput because of instruction caching.


I was wondering if optimization of the graph to understand where
concurrency can be leveraged has been considered, as well as where you
could process the vector by two steps with an offset. If this is possible,
then steps could be pinned to cores and perhaps both concurrency and
instruction caching could be leveraged.


For example assume the following graph:


[image: image.png]


In this graph, steps B,C can be done concurrently as they don't "modify"
the vector. Steps D, E can't be done concurrently, but as they don't
require look back/forward they can be done in offset.


What I am suggesting is, if there are enough cores, then steps could be
pinned to cores to achieve the benefits of instruction caching, and after
step A is complete, steps B,C could be done concurrently. After B,C are
complete then D can be started and as D completes processing on a packet if
can then be processed by E (i.e., the entire vector does not need to be
processed by D before processing by E is started).


I make no argument that this doesn't increase complexity and also
introduces coordination costs that don't exists today. To be fair, offset
processing could be viewed as splitting the original large vector into
smaller vectors and processing the smaller vectors from start to finish
(almost dynamic optimization based on dynamic vector resizing).

Just curious to hear others thoughts and if some of this has been thought
through or experimented with. As I said, just thinking off the cuff and
wondering; not fully thought through.


avèk respè,

/david
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev