On Wed, Aug 06, 2014 at 04:59:59PM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > Sent: Wednesday, August 06, 2014 5:19 AM
> > To: Ananyev, Konstantin
> > Cc: dev at dpdk.org; Thomas Monjalon; Richardson, Bruce
> > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate missing
> > instructions with C code
> > 
> > On Wed, Aug 06, 2014 at 11:39:22AM +0000, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > Sent: Tuesday, August 05, 2014 7:21 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev at dpdk.org; Thomas Monjalon; Richardson, Bruce
> > > > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate 
> > > > missing
> > instructions with C code
> > > >
> > > > On Tue, Aug 05, 2014 at 03:26:27PM +0000, Ananyev, Konstantin wrote:
> > > > > Hi Neil,
> > > > >
> > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > Sent: Monday, August 04, 2014 4:36 PM
> > > > > > To: dev at dpdk.org
> > > > > > Cc: Neil Horman; Thomas Monjalon; Ananyev, Konstantin; Richardson,
> > Bruce
> > > > > > Subject: [PATCH] acl: If build does not support sse4.2, emulate 
> > > > > > missing
> > instructions with C code
> > > > > >
> > > > > > The ACL library makes extensive use of some SSE4.2 instructions, 
> > > > > > which
> > means the
> > > > > > default build can't compile this library.  Work around the problem 
> > > > > > by
> > testing
> > > > > > the __SSE42__ definition in the acl_vects.h file and defining the 
> > > > > > macros
> > there
> > > > > > as intrinsics or c-level equivalants.  Note this is a minimal 
> > > > > > patch, adjusting
> > > > > > only the definitions that are currently used in the ACL library.
> > > > > >
> > > > >
> > > > > My comments about actual implementations of c-level equivalents below.
> > > > > None of them look correct to me.
> > > > > Of course it could be fixed.
> > > > > Though I am not sure that is a right way to proceed:
> > > > > At first I really doubt that these equivalents will provide similar
> > performance.
> > > > > As you probably note - we do have a scalar version of  
> > > > > rte_acl_classify():
> > rte_acl_classify_scalar().
> > > > > So I think it might be faster than vector one with 'emulated' 
> > > > > instrincts.
> > > > > Unfortunately it is all mixed right now into one file and 'scalar' 
> > > > > version
> > could use few sse4 instrincts through resolve_priority().
> > > > > Another thing - we consider to add another version of 
> > > > > rte_acl_classify()
> > that will use avx2 instrincts.
> > > > > If we go the way you suggest - I am afraid will soon have to provide 
> > > > > scalar
> > equivalents for several AVX2 instrincts too.
> > > > > So in summary that way (providing our own scalar equivalents of SIMD
> > instrincts) seems to me slow, hard to maintain and error
> > > > prone.
> > > > >
> > > > > What porbably can be done instead: rework acl_run.c a bit, so it 
> > > > > provide:
> > > > > rte_acl_classify_scalar() - could be build and used on all systems.
> > > > > rte_acl_classify_sse() - could be build and used only on systems with 
> > > > > sse4.2
> > and upper, return ENOTSUP on lower arch.
> > > > > In future: rte_acl_classify_avx2 - could be build and used only on 
> > > > > systems
> > with avx2 and upper, return ENOTSUP on lower arch.
> > > > >
> > > > > I am looking at rte_acl right now anyway.
> > > > > So will try to come up with something workable.
> > > > >
> > > > So, this is exactly the opposite of what Bruce and I just spent several 
> > > > days
> > and
> > > > a huge email thread that you clearly are aware of discussing run time 
> > > > versus
> > > > compile time selection of paths.  At this point I'm done ping ponging
> > between
> > > > your opposing viewpoints.  If you want to implement something that does
> > run time
> > > > checking, I'm fine with it, but I'm not going back and forth until you 
> > > > two
> > come
> > > > to an agreement on this.
> > >
> > > Right now, I am not talking about 'run time vs compile time selection'.
> > But you are talking about exactly that, allbeit implicitly.  To implement 
> > what
> > you recommend above (that being multiple functional paths that return a not
> > supported error code at run time), we need to make run time tests for what 
> > the
> > cpu supports.  While I'm actually ok with doing that (I think it makes alot 
> > of
> > sense), Bruce and I just spent several days and dozens of emails debating 
> > that,
> > so you can understand why I don't want to write yet another version of this
> > patch that requires doing the exact thing we just argued about, especially 
> > if it
> > means he's going to pipe back up and say no, driving me back to a common
> > single
> > implementation that compiles and runs for all platforms.  I'm not going to 
> > keep
> > re-writing this boucing back and forth between your opposing viewpoints.  We
> > need to agree on a direction before I make another pass at this.
> > 
> > > Whatever way we choose, I think the implementation need to be:
> > > 1) correct
> > Obviously.
> > 
> > > 2) allow easily add(/modify) code path optimised for particular 
> > > architecture.
> > > Without need to modify/re-test what you call  'least common denominator'
> > code path.
> > > And visa-versa, if someone find a way to optimise common code path - no
> > need to
> > > touch/retest architecture specific ones.
> > So I'm fine with this, but it is anathema to what Bruce advocated for when 
> > I did
> > this latest iteration.  Bruce advocated for a single common path that 
> > compiled
> > in all cases.  Bruce, do you want to comment here?  I'd really like to get 
> > this
> > settled before I go try this again.
> 
> In our previous discussion I was primarily concerned with the ixgbe driver, 
> which already had a number of scalar code paths as well as the vector one, so 
> I was very keen there not to see more code paths created. However, while I 
> hate seeing more code paths created that need to be maintained, I am ok with 
> having them created if the benefit is big enough. Up till now code path 
> selection would have been done at compile-time, but you've convinced me that 
> if we have the code paths there, selecting them at runtime makes more sense 
> for a packaged build. 
> For ACL specifically, I generally defer to Konstantin as the expert in this 
> area. If we need separate code paths for scalar, SSE and AVX, and each gives 
> considerable performance improvement over the other, then I'm ok with that.
> 
> /Bruce
> 

I'm still not sure how you thought I was creating new code paths in the ixgbe
driver using run time selection vs. compile time selection, but regardless, if
run time path selection is the consensus, thats good, I can do that.

Neil


Reply via email to