Bug#1064811: libhipsparse0-tests: HIPSPARSE_STATUS_INTERNAL_ERROR

2024-02-26 Thread Cordell Bloor

Hi Christian,

On 2024-02-26 00:54, Christian Kastner wrote:

On gfx1031/gfx1032/gfx1034, there are numerous occurrences of
HIPSPARSE_STATUS_INTERNAL_ERROR, see [1] for a full log. Interestingly,
only some of them lead to test failures (some examples below), and
sometimes there is more than one occurrence per test.

These passed on gfx900/gfx1030 so I don't immediately suspect my update
to the optional-test-matrices resp. allow-missing-matrix-data-in-tests
patches to be the cause.


If it works on gfx1030, but fails on gfx1031, gfx1032 and gfx1034, it is 
almost certainly caused by run-time dispatching in rocprim [2]. I would 
ignore this issue until rocprim passes its tests on those architectures.


We only build for gfx1030, so the problem is that when running on 
gfx103{1,2,4}, rocPRIM is dynamically checking the current GPU 
architecture and dispatching to something other than the gfx1030 
implementation. If you'd like to verify that this is the problem, you 
can run the tests with the environment variable 
HSA_OVERRIDE_GFX_VERSION=10.3.0 on gfx1031, gfx1032  or gfx1034 hardware.


The solution is tricky, though, because rocPRIM is a header-only 
library. We can't use a solution like in rocBLAS [3] where we force 
gfx1031 to use gfx1030 code objects, because librocprim-dev users might 
be building their code for gfx1031... unless maybe we hide that dispatch 
behaviour behind an #ifdef that rocsparse can define.


Sincerely,
Cory Bloor

[2]: 
https://salsa.debian.org/rocm-team/rocprim/-/blob/debian/5.7.1-1/rocprim/include/rocprim/device/config_types.hpp?ref_type=tags#L247
[3]: 
https://salsa.debian.org/rocm-team/rocblas/-/blob/debian/5.5.1+dfsg-4/debian/patches/0012-expand-isa-compatibility.patch?ref_type=tags


Bug#1064811: libhipsparse0-tests: HIPSPARSE_STATUS_INTERNAL_ERROR

2024-02-26 Thread Christian Kastner
Package: libhipsparse0-tests
Version: 5.7.1-1~exp1
Severity: normal

On gfx1031/gfx1032/gfx1034, there are numerous occurrences of
HIPSPARSE_STATUS_INTERNAL_ERROR, see [1] for a full log. Interestingly,
only some of them lead to test failures (some examples below), and
sometimes there is more than one occurrence per test.

These passed on gfx900/gfx1030 so I don't immediately suspect my update
to the optional-test-matrices resp. allow-missing-matrix-data-in-tests
patches to be the cause.

>  63s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  63s ./clients/tests/test_dense_to_sparse_csr.cpp:38: Failure
>  63s Expected equality of these values:
>  63s   status
>  63s Which is: 7
>  63s   HIPSPARSE_STATUS_SUCCESS
>  63s Which is: 0
>  63s 
>  63s [  FAILED  ] dense_to_sparse_csr.dense_to_sparse_csr_i32_i32_float (7 ms)

> 63s [ RUN  ] dense_to_sparse_csc.dense_to_sparse_csc_i64_i32_double
>  63s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  63s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  63s [   OK ] dense_to_sparse_csc.dense_to_sparse_csc_i64_i32_double (3 
> ms)

> 66s [ RUN  ] bsrmv/parameterized_bsrmv.bsrmv_float/260
>  66s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  66s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  66s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  66s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  66s ./clients/tests/test_bsrmv.cpp:114: Failure
>  66s Expected equality of these values:
>  66s   status
>  66s Which is: 7
>  66s   HIPSPARSE_STATUS_SUCCESS
>  66s Which is: 0
>  66s 
>  66s [  FAILED  ] bsrmv/parameterized_bsrmv.bsrmv_float/260, where GetParam() 
> = (500, 842, 3, 1, 9, 0, 0) (1 ms)

[1]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1031/h/hipsparse/7762/log.gz