I recently adopted the bitslice approach in my own software and found a useless instruction generated by a macro that affects the performance of the CUDA implementation.
See: /algorithm/A51/implementation/common/partitioned_bitslice.hpp Revision 79, Line: 115 BOOST_PP_REPEAT(23, pbs_clock_r3,); 22 repetitions are enough: BOOST_PP_REPEAT(22, pbs_clock_r3,); Luckily it does not affect the correctness. Due to the fact that: "If the difference between x and y is less than 0, the result is saturated to 0." no error was produced. Instead just an unnecessary r3_0 assignment that does not make any sense: r3_0 = r3_0 & not_clock_r3 | r3_0 & do_clock_r3; If the compiler would recognize that not_clock_r3 and do_clock_r3 are complements maybe this command would be omitted. _______________________________________________ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51