save-buffer commented on PR #13661:
URL: https://github.com/apache/arrow/pull/13661#issuecomment-1191888800

   I've been trying to get caught up on the context here - I took a look at 
#13654. My current understanding is:
   - The problem we are trying to solve are insanely large functions generated 
by the codegen framework when using -O3
   - The theory is that it has to do with -O3 applying tons of crazy 
optimizations that leads to lots of bloat due to too much vectorized code
   Does that sound right? 
   
   So looking at the results, -O3 adds about 1MB (to ~22MB) to the total binary 
size, so I think that's not an issue itself. However, there is something to be 
said about bloating individual kernels. Reading the other PR, it seems like one 
of the kernels was 40 KB big? That's quite alarming as chips these days have 
about 32 KB of icache. In the worst case, that's quite a bit of thrashing. 
   That particular disassembly looks to me like the compiler is vectorizing 
_and_ unrolling the loop after vectorizing it. 
   
   As for solutions: Looking at the benchmarks, it seems like the current code 
is pretty unstable with regards to what the compiler generates when it comes to 
flags. I'm not sure messing with compiler flags will be one-size-fits-all as 
each combination of flags causes large changes in the generated code. I did 
like the changes in #13654.
   
   I really liked this point, which very much aligns with my experience and 
intuition that abstract templates lead to unstable code generation:
   > our approach (so much for "zero cost abstractions") for generalizing to 
abstract between writing to an array versus packing a bitmap is causing too 
much code to be generated. 
   
   So two solutions we could have are:
   - Keep existing code and compilation flags but explicitly disable them for 
problematic kernels (using something like `#pragma GCC push_options` and 
`#pragma GCC pop_options`, though I'm not sure if there's a way to do this on 
MSVC.
   - Change the code to use fewer templates and more raw for loops. If we're 
feeling really adventurous, we could write a Python or Jinja script that 
generates the kernels as the simplest possible for loop (I know this is the 
approach used in a lot of databases). I have never seen a problem with this 
style of code even on -O3. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to