On Wed, Aug 06, 2014 at 04:59:59PM +0000, Richardson, Bruce wrote: > > -----Original Message----- > > From: Neil Horman [mailto:nhorman at tuxdriver.com] > > Sent: Wednesday, August 06, 2014 5:19 AM > > To: Ananyev, Konstantin > > Cc: dev at dpdk.org; Thomas Monjalon; Richardson, Bruce > > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate missing > > instructions with C code > > > > On Wed, Aug 06, 2014 at 11:39:22AM +0000, Ananyev, Konstantin wrote: > > > > > > > > > > -----Original Message----- > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com] > > > > Sent: Tuesday, August 05, 2014 7:21 PM > > > > To: Ananyev, Konstantin > > > > Cc: dev at dpdk.org; Thomas Monjalon; Richardson, Bruce > > > > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate > > > > missing > > instructions with C code > > > > > > > > On Tue, Aug 05, 2014 at 03:26:27PM +0000, Ananyev, Konstantin wrote: > > > > > Hi Neil, > > > > > > > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com] > > > > > > Sent: Monday, August 04, 2014 4:36 PM > > > > > > To: dev at dpdk.org > > > > > > Cc: Neil Horman; Thomas Monjalon; Ananyev, Konstantin; Richardson, > > Bruce > > > > > > Subject: [PATCH] acl: If build does not support sse4.2, emulate > > > > > > missing > > instructions with C code > > > > > > > > > > > > The ACL library makes extensive use of some SSE4.2 instructions, > > > > > > which > > means the > > > > > > default build can't compile this library. Work around the problem > > > > > > by > > testing > > > > > > the __SSE42__ definition in the acl_vects.h file and defining the > > > > > > macros > > there > > > > > > as intrinsics or c-level equivalants. Note this is a minimal > > > > > > patch, adjusting > > > > > > only the definitions that are currently used in the ACL library. > > > > > > > > > > > > > > > > My comments about actual implementations of c-level equivalents below. > > > > > None of them look correct to me. > > > > > Of course it could be fixed. > > > > > Though I am not sure that is a right way to proceed: > > > > > At first I really doubt that these equivalents will provide similar > > performance. > > > > > As you probably note - we do have a scalar version of > > > > > rte_acl_classify(): > > rte_acl_classify_scalar(). > > > > > So I think it might be faster than vector one with 'emulated' > > > > > instrincts. > > > > > Unfortunately it is all mixed right now into one file and 'scalar' > > > > > version > > could use few sse4 instrincts through resolve_priority(). > > > > > Another thing - we consider to add another version of > > > > > rte_acl_classify() > > that will use avx2 instrincts. > > > > > If we go the way you suggest - I am afraid will soon have to provide > > > > > scalar > > equivalents for several AVX2 instrincts too. > > > > > So in summary that way (providing our own scalar equivalents of SIMD > > instrincts) seems to me slow, hard to maintain and error > > > > prone. > > > > > > > > > > What porbably can be done instead: rework acl_run.c a bit, so it > > > > > provide: > > > > > rte_acl_classify_scalar() - could be build and used on all systems. > > > > > rte_acl_classify_sse() - could be build and used only on systems with > > > > > sse4.2 > > and upper, return ENOTSUP on lower arch. > > > > > In future: rte_acl_classify_avx2 - could be build and used only on > > > > > systems > > with avx2 and upper, return ENOTSUP on lower arch. > > > > > > > > > > I am looking at rte_acl right now anyway. > > > > > So will try to come up with something workable. > > > > > > > > > So, this is exactly the opposite of what Bruce and I just spent several > > > > days > > and > > > > a huge email thread that you clearly are aware of discussing run time > > > > versus > > > > compile time selection of paths. At this point I'm done ping ponging > > between > > > > your opposing viewpoints. If you want to implement something that does > > run time > > > > checking, I'm fine with it, but I'm not going back and forth until you > > > > two > > come > > > > to an agreement on this. > > > > > > Right now, I am not talking about 'run time vs compile time selection'. > > But you are talking about exactly that, allbeit implicitly. To implement > > what > > you recommend above (that being multiple functional paths that return a not > > supported error code at run time), we need to make run time tests for what > > the > > cpu supports. While I'm actually ok with doing that (I think it makes alot > > of > > sense), Bruce and I just spent several days and dozens of emails debating > > that, > > so you can understand why I don't want to write yet another version of this > > patch that requires doing the exact thing we just argued about, especially > > if it > > means he's going to pipe back up and say no, driving me back to a common > > single > > implementation that compiles and runs for all platforms. I'm not going to > > keep > > re-writing this boucing back and forth between your opposing viewpoints. We > > need to agree on a direction before I make another pass at this. > > > > > Whatever way we choose, I think the implementation need to be: > > > 1) correct > > Obviously. > > > > > 2) allow easily add(/modify) code path optimised for particular > > > architecture. > > > Without need to modify/re-test what you call 'least common denominator' > > code path. > > > And visa-versa, if someone find a way to optimise common code path - no > > need to > > > touch/retest architecture specific ones. > > So I'm fine with this, but it is anathema to what Bruce advocated for when > > I did > > this latest iteration. Bruce advocated for a single common path that > > compiled > > in all cases. Bruce, do you want to comment here? I'd really like to get > > this > > settled before I go try this again. > > In our previous discussion I was primarily concerned with the ixgbe driver, > which already had a number of scalar code paths as well as the vector one, so > I was very keen there not to see more code paths created. However, while I > hate seeing more code paths created that need to be maintained, I am ok with > having them created if the benefit is big enough. Up till now code path > selection would have been done at compile-time, but you've convinced me that > if we have the code paths there, selecting them at runtime makes more sense > for a packaged build. > For ACL specifically, I generally defer to Konstantin as the expert in this > area. If we need separate code paths for scalar, SSE and AVX, and each gives > considerable performance improvement over the other, then I'm ok with that. > > /Bruce >
I'm still not sure how you thought I was creating new code paths in the ixgbe driver using run time selection vs. compile time selection, but regardless, if run time path selection is the consensus, thats good, I can do that. Neil