I recently adopted the bitslice approach in my own software and found
a useless instruction generated by a macro that affects the
performance of the CUDA implementation.

See: /algorithm/A51/implementation/common/partitioned_bitslice.hpp

Revision 79, Line: 115

BOOST_PP_REPEAT(23, pbs_clock_r3,);

22 repetitions are enough:

BOOST_PP_REPEAT(22, pbs_clock_r3,);

Luckily it does not affect the correctness. Due to the fact that: "If
the difference between x and y is less than 0, the result is saturated
to 0." no error was produced. Instead just an unnecessary r3_0
assignment that does not make any sense:

r3_0 = r3_0 & not_clock_r3 | r3_0 & do_clock_r3;

If the compiler would recognize that not_clock_r3 and do_clock_r3 are
complements maybe this command would be omitted.
_______________________________________________
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Reply via email to