louislepage commented on PR #1849: URL: https://github.com/apache/systemds/pull/1849#issuecomment-1615848526
## Experiment: To analyze and compare the performance of DenseBlockBool as Bitset and as BooleanArray to the current FP64 implementation, a java program was created that uses the classes from systemds ([systemds-boolean-matrix-evaluation](https://github.com/louislepage/systemds-boolean-matrix-evaluation)). For each given matrix size, it does five computations with pseudo-randomly generated matrices for double to boolean operators and for boolean to boolean operators. Both the total runtime of each operation and the memory used are stored. Finally, the average over the five runs is taken. The Matrix sizes are chosen to represent up to 16mio total entries, but with different dims. We used square matrices of up to 4000x4000 and non-square ones of up to 100x160000. In the non-square case, one dimension was always fixed to a length of 100. ## Evaluation of Double to Bool (GreaterThan) As input, two dense FP64 Matrices are randomly generated based on a given seed. As the output, a MatrixBlock with a DenseBlock of the desired type (FP64, Bitset, or BooleanArray) is created. The timer is started, the operator called with the two input and the output matrices, the timer is stopped, and the result matrix is returned for validation. Memory is computed as the difference between the memory used before the in- and output matrices are created and after the result is computed (but before any Matrix blocks were dereferenced).  The plot shows the runtimes on the left and memory usage on the right. In blue, the results of the bitset implementation are plotted. In green is the current Double/FP64 implementation, and in orange is the Boolean Array implementation. The different markers represent the results of square (square marker), Nx100 (circle), and 100xN (cross) matrices. As expected, both runtime and memory usage increase linearly with an increased number of entries in the matrix. The results show a visible performance increase for computation regarding runtime when using BooleanArray. This is most likely due to faster, direct access to the smaller arrays and fewer translations between boolean results and their double representation to store them in the result matrix since the operators can return boolean values, which can then be directly stored in the result matrices. Interestingly the boolean array outperforms the Bitset here. The Bitset was even slower than the FP64 implementation. Without further investigation, this is probably due to fewer instructions necessary to access the boolean array compared to more when storing a value in a bitset. Also, it is worth noting that the bitset stores the values internally as `long` and therefore has to access an 8bytes large long whenever accessing a value. This might influence caching. Regarding memory, both boolean implementations took less space. As expected, the Bitset takes even less space since each boolean in the boolean arrays still takes up one byte, but the bitset stores the booleans as individual bits that are collected into `long`. There is no difference in runtime or memory when using non-square matrices as long as the total number of entires is the same. ## Evaluation of Boolen To Boolean (Xor) As input, two dense Matrices of the given type (FP64, Bitset, or BooleanArray) with boolean values are randomly generated based on a given seed. As the output, a MatrixBlock with a DenseBlock of the desired type is created. The timer is started, and the operator is called with the two input and the output matrices in a loop for 20 iterations. The timer is stopped, and the result matrix is returned for validation. Memory is computed as the difference between the memory used before the in- and output matrices are created and after the loop ended (but before any Matrix blocks were dereferenced).  The plot shows the runtimes on the left and memory usage on the right. In blue, the results of the bitset implementation are plotted. In green is the current Double/FP64 implementation, and in orange is the Boolean Array implementation. The different markers represent the results of square (square marker), Nx100 (circle), and 100xN (cross) matrices. As expected, an improved runtime for the boolean arrays is also visible in this case and both runtime and memory usage increase linearly with an increased number of entries in the matrix. We can also see an improvement in memory usage when using a boolean implementation. The memory used is less than when running the Double to Boolean operator, even when using FP64, but this is probably due to reusing the result matrix as input to the next computation. Therefore after the first iteration, there are only two matrices in memory. Memory usage for the Boolean to Boolean case, when using actual Booleans as input, has decreased significantly since all matrices in memory are small, not only the output matrix. It is also visible that the bitset again outperforms the boolean array regarding memory usage. There is no difference in runtime or memory when using non-square matrices as long as the total number of entries is the same. ## Conclusion: Since both boolean implementations improve memory usage, but only the boolean array outperforms the FP64 implementation in runtime, I suggest using the boolean array as the DenseBlock for any operation that results in only boolean data. Let me know if there are any further open questions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
