Re: [vpp-dev] vlib_validate_buffer_enqueue

2017-11-13 Thread Justin Iurman
Dave,

Thanks for your (deep) explanation. Same for Chris, thank you :-)

Justin

> Dear Justin,
> 
> Quad-loops are generally not effective for table-lookup-intensive tasks. At a
> certain point, gcc runs out of registers and starts putting hot variables onto
> the stack. I've converted a number of dual loops into quad loops, only to
> discover that they're no faster than the dual loop version.
> 
> Rather than having the sample plugin propagate a bunch of "fetch me a rock"
> coding work, I went with a dual-single loop. When doing new development, I 
> shut
> off the dual loop, make the single loop work, then build the dual (or quad)
> loop.
> 
> With experience, building a dual (or quad) loop becomes a mechanical exercise
> easily done during a boring meeting. ()...
> 
> In viable quad-loop use-cases, it's not worth any performance to also provide 
> a
> dual loop. The dual-loop code will run at most one time; there's no chance of
> fixed overhead amortization.
> 
> Thanks… Dave
> 
> -Original Message-
> From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
> Behalf
> Of Justin Iurman
> Sent: Monday, November 13, 2017 5:51 AM
> To: vpp-dev <vpp-dev@lists.fd.io>
> Subject: [vpp-dev] vlib_validate_buffer_enqueue
> 
> Hey guys,
> 
> In buffer_node.h, there are the following macros:
> - vlib_validate_buffer_enqueue_x1
> - vlib_validate_buffer_enqueue_x2
> - vlib_validate_buffer_enqueue_x4
> 
> In a node, I was just wondering what was the use idea behind that ? Is it for 
> a
> reason of speed ? I mean, you're obviously faster if you process 4 packets
> horizontally than one after the other. Why then, in the sample plugin, is the
> "x4" version not used ? A "perfect" plugin would use each of them to cover 
> each
> case, right ? Also, why not having a "x8" (or more) version ? I guess it's
> either for a performance issue or to stop at a specific ceiling.
> 
> Thanks !
> 
> Justin
> ___
> vpp-dev mailing list
> vpp-dev@lists.fd.io
> https://lists.fd.io/mailman/listinfo/vpp-dev
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] vlib_validate_buffer_enqueue

2017-11-13 Thread Dave Barach (dbarach)
Dear Justin,

Quad-loops are generally not effective for table-lookup-intensive tasks. At a 
certain point, gcc runs out of registers and starts putting hot variables onto 
the stack. I've converted a number of dual loops into quad loops, only to 
discover that they're no faster than the dual loop version.

Rather than having the sample plugin propagate a bunch of "fetch me a rock" 
coding work, I went with a dual-single loop. When doing new development, I shut 
off the dual loop, make the single loop work, then build the dual (or quad) 
loop. 

With experience, building a dual (or quad) loop becomes a mechanical exercise 
easily done during a boring meeting. ()... 

In viable quad-loop use-cases, it's not worth any performance to also provide a 
dual loop. The dual-loop code will run at most one time; there's no chance of 
fixed overhead amortization. 

Thanks… Dave

-Original Message-
From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of Justin Iurman
Sent: Monday, November 13, 2017 5:51 AM
To: vpp-dev <vpp-dev@lists.fd.io>
Subject: [vpp-dev] vlib_validate_buffer_enqueue

Hey guys,

In buffer_node.h, there are the following macros:
- vlib_validate_buffer_enqueue_x1
- vlib_validate_buffer_enqueue_x2
- vlib_validate_buffer_enqueue_x4

In a node, I was just wondering what was the use idea behind that ? Is it for a 
reason of speed ? I mean, you're obviously faster if you process 4 packets 
horizontally than one after the other. Why then, in the sample plugin, is the 
"x4" version not used ? A "perfect" plugin would use each of them to cover each 
case, right ? Also, why not having a "x8" (or more) version ? I guess it's 
either for a performance issue or to stop at a specific ceiling.

Thanks !

Justin
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] vlib_validate_buffer_enqueue

2017-11-13 Thread Luke, Chris
The x4 variant was introduced chronologically after the sample plugin and 
nobody went back to update it. However, generally speaking the four-wide-stride 
is only beneficial in some cases, the reasoning for which is a bit arcane based 
on the likelihood of being able to keep the CPU cache primed and similar. The 
best I can tell is that there's a bit of judgement based on the empirical 
experience of a handful of the wizened. By extension the gains from a x8 
version is likely marginal.

Chris

> -Original Message-
> From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On
> Behalf Of Justin Iurman
> Sent: Monday, November 13, 2017 5:51 AM
> To: vpp-dev <vpp-dev@lists.fd.io>
> Subject: [vpp-dev] vlib_validate_buffer_enqueue
> 
> Hey guys,
> 
> In buffer_node.h, there are the following macros:
> - vlib_validate_buffer_enqueue_x1
> - vlib_validate_buffer_enqueue_x2
> - vlib_validate_buffer_enqueue_x4
> 
> In a node, I was just wondering what was the use idea behind that ? Is it for 
> a
> reason of speed ? I mean, you're obviously faster if you process 4 packets
> horizontally than one after the other. Why then, in the sample plugin, is the
> "x4" version not used ? A "perfect" plugin would use each of them to cover
> each case, right ? Also, why not having a "x8" (or more) version ? I guess 
> it's
> either for a performance issue or to stop at a specific ceiling.
> 
> Thanks !
> 
> Justin
> ___
> vpp-dev mailing list
> vpp-dev@lists.fd.io
> https://lists.fd.io/mailman/listinfo/vpp-dev

___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev


[vpp-dev] vlib_validate_buffer_enqueue

2017-11-13 Thread Justin Iurman
Hey guys,

In buffer_node.h, there are the following macros:
- vlib_validate_buffer_enqueue_x1
- vlib_validate_buffer_enqueue_x2
- vlib_validate_buffer_enqueue_x4

In a node, I was just wondering what was the use idea behind that ? Is it for a 
reason of speed ? I mean, you're obviously faster if you process 4 packets 
horizontally than one after the other. Why then, in the sample plugin, is the 
"x4" version not used ? A "perfect" plugin would use each of them to cover each 
case, right ? Also, why not having a "x8" (or more) version ? I guess it's 
either for a performance issue or to stop at a specific ceiling.

Thanks !

Justin
___
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev