Hello again. So I added 64-bit support to theano/gpuarray/c_code/magma_cholesky.c and to theano/gpuarray/linalg.py in the function GpuMagmaCholesky(). I attached the files. It works now for 32 and 64 bit and has gradient. The numerical problem is gone. But (and this is a big BUT) it iseems to be a factor of at least 2 times slower than the CPU. Any thoughts on this? Paul
On Thursday, February 6, 2020 at 10:28:08 AM UTC+1, Paul Baggenstoss wrote: > > Simon, > I did more digging and have some more information. I tested > theano.gpuarray.linalg.GpuMagmaCholesky(), on float32 and it looks good. > The result is exactly the same as for CPU. > So the problem seems to be in CUsolver. The problem is that > theano.gpuarray.linalg.GpuMagmaCholesky()(Cll) does not define a gradient > and works only for > float32. I installed the latest magma-2.5.2 and it has support for double > precision Cholesky (dpotrf) but Theano seems to use it's own copy of the > MAGMA source. > Not sure how that works. Can I force Theano to use magma-2.5.2 ? If not, > it seems feasible to borrow the gradient from > theano.gpuarray.linalg.GpuCholesky() > and add support for float64 as well. Thoughts? > Paul > > > On Wednesday, February 5, 2020 at 5:32:43 PM UTC+1, Paul Baggenstoss wrote: >> >> Hi Simon, I forgot to mention that I use the gradient of Cholesky, and >> this has even more error than the Cholesky decomo, but I assume that this >> is because >> of a bug in Cholesky itself. >> Paul >> >> >> On Wednesday, February 5, 2020 at 5:30:10 PM UTC+1, Paul Baggenstoss >> wrote: >>> >>> Hi Simon,I have uploaded the MATLAB format file with the matrix Cll, >>> which is the original matrix, and R_cpu which was produced using CPU by >>> slinalg.Cholesky( ), and R_cuda which >>> was produced by the same function, but with GPU ( I think it uses >>> theano.gpuarray.linalg.GpuCholesky() ) Both used the same precision >>> (float32) so should give the same results. >>> But you can see that at the end of the diagonal, the values go wild. It >>> appears to be numericla errors. >>> Thanks in advance! >>> Paul >>> >>> >>> >>> >>> On Wednesday, February 5, 2020 at 4:56:14 PM UTC+1, Wong Hang wrote: >>>> >>>> >>>> Hi, >>>> >>>> The GPU cholesky decomposition relies on cuSOLVER or Magma. I believe >>>> nvidia knows their hardware well and cuSOLVER should provide the best >>>> efficient result. >>>> >>>> Although cholesky decomposition is very numerical stable, when I write >>>> the test case, I find that I will get trouble for relatively small matrix >>>> if I use single-precision. >>>> >>>> Are you using single-precision on a big matrix? >>>> If not, try to compute the condition number of the matrix to see if it >>>> is too big. >>>> >>>> If it is not too big, then it may be a bug. I also need to use the >>>> cholesky operator, Please send me the matrix and I am willing to fix it. >>>> >>>> Best, >>>> >>>> 2020年2月6日(木) 0:34 Paul Baggenstoss <p.m.ba...@ieee.org>: >>>> >>>>> HI Simon, I was wondering if you got anywhere with the faster Cholesky >>>>> for Theano. I also use it a lot and have found it to be unstable on the >>>>> GPU. >>>>> Paul >>>>> >>>>> On Saturday, March 7, 2015 at 11:45:36 AM UTC+1, Simon Ebner wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I want to do computations where I rely heavily on the Cholesky >>>>>> decomposition. Writing a small benchmark for tensor.slinalg.Cholesky, I >>>>>> noticed that the implementation is not as fast as I hoped. As far as I >>>>>> can >>>>>> tell it is not optimized for GPUs yet but relies on the scipy >>>>>> implementation? >>>>>> Doing a bit of a google seach I found several cuda implementations >>>>>> for fast Cholesky decompositions on the GPU. Before I try to include >>>>>> that >>>>>> code into my theano environment, could you let me know whether you >>>>>> decided >>>>>> not to implement fast Cholesky decomposition on the GPU on purpose? >>>>>> Furthermore, since I'm fairly new to theano I'm not completely confident >>>>>> how to incorporate cuda code best into my existing theano code. Is the >>>>>> sensible to create a custom OP with optimized C-Code? >>>>>> >>>>>> Best, >>>>>> Simon >>>>>> >>>>> -- >>>>> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "theano-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to theano...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/theano-users/aca41c35-ec36-4055-bac7-e53aced30ea7%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/theano-users/aca41c35-ec36-4055-bac7-e53aced30ea7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/d0befbe8-18b1-4eee-8533-b6c2ee15f565%40googlegroups.com.
from __future__ import absolute_import, division, print_function import warnings import pkg_resources import numpy as np from numpy.linalg.linalg import LinAlgError import theano from theano import Op, config, tensor from theano.scalar import bool as bool_t from theano.gof import COp, ParamsType from theano.gpuarray import GpuArrayType from .basic_ops import (CGpuKernelBase, as_gpuarray_variable, gpu_contiguous, gpuarray_helper_inc_dir, infer_context_name) from .type import gpu_context_type try: import pygpu from pygpu.basic import triu, tril pygpu_available = True except ImportError: pygpu_available = False cusolver_available = False try: import skcuda from skcuda import cusolver cusolver_available = True except (ImportError, OSError, RuntimeError, pkg_resources.DistributionNotFound): pass cublas_available = False try: from skcuda import cublas cublas_available = True except (ImportError, OSError, RuntimeError, pkg_resources.DistributionNotFound): pass if cusolver_available: # Add cusolver call as it is missing in skcuda # SPOTRS cusolver._libcusolver.cusolverDnSpotrs.restype = int cusolver._libcusolver.cusolverDnSpotrs.argtypes = [cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p] def cusolverDnSpotrs(handle, uplo, n, nrhs, A, lda, B, ldb, devInfo): """ Solve real single precision linear system for hermitian matrices. References ---------- `cusolverDn<t>potrs <http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-potrs>`_ """ status = cusolver._libcusolver.cusolverDnSpotrs(handle, uplo, n, nrhs, int(A), lda, int(B), ldb, int(devInfo)) cusolver.cusolverCheckStatus(status) # DPOTRS # TODO: Are they still missing in skucda? cusolver._libcusolver.cusolverDnDpotrs.restype = int cusolver._libcusolver.cusolverDnDpotrs.argtypes = [cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p] def cusolverDnDpotrs(handle, uplo, n, nrhs, A, lda, B, ldb, devInfo): status = cusolver._libcusolver.cusolverDnDpotrs(handle, uplo, n, nrhs, int(A), lda, int(B), ldb, int(devInfo)) cusolver.cusolverCheckStatus(status) def attach_cusolver_handle_to_context(ctx): handle = getattr(ctx, 'cusolver_handle', None) if handle is None: with ctx: ctx.cusolver_handle = cusolver.cusolverDnCreate() def attach_cublas_handle_to_context(ctx): handle = getattr(ctx, 'cublas_handle', None) if handle is None: with ctx: ctx.cublas_handle = cublas.cublasCreate() # it is a subset of all cases available in slinalg's MATRIX_STRUCTURE MATRIX_STRUCTURES_SOLVE = ( 'general', 'symmetric', 'lower_triangular', 'upper_triangular') class GpuCusolverSolve(Op): """ CUSOLVER GPU solver OP. Parameters ---------- trans Whether to take the transpose of the input matrix or not. """ __props__ = ('A_structure', 'trans', 'inplace') def __init__(self, A_structure='general', trans='N', inplace=False): self.trans = trans self.inplace = inplace self.A_structure = A_structure if self.inplace: self.destroy_map = {0: [0]} assert A_structure in MATRIX_STRUCTURES_SOLVE super(GpuCusolverSolve, self).__init__() def make_node(self, inp1, inp2): if not cusolver_available: raise RuntimeError('CUSOLVER is not available and ' 'GpuCusolverSolve Op can not be constructed.') if skcuda.__version__ <= '0.5.1': warnings.warn('The GpuSolve op requires scikit-cuda > 0.5.1 to work with CUDA 8') context_name = infer_context_name(inp1, inp2) inp1 = as_gpuarray_variable(inp1, context_name) inp2 = as_gpuarray_variable(inp2, context_name) inp1 = gpu_contiguous(inp1) inp2 = gpu_contiguous(inp2) assert inp1.ndim == 2 assert inp2.ndim == 2 assert inp1.dtype == inp2.dtype return theano.Apply( self, [inp1, inp2], [GpuArrayType(inp1.dtype, broadcastable=inp1.broadcastable, context_name=context_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cusolver_handle_to_context(ctx) def check_dev_info(self, dev_info): val = np.asarray(dev_info)[0] if val > 0: raise LinAlgError('A is singular') def perform(self, node, inputs, outputs): context = inputs[0][0].context # Size of the matrices to invert. z = outputs[0] # Matrix. A = inputs[0] # Solution vectors. b = inputs[1] assert(len(A.shape) == 2) assert(len(b.shape) == 2) if self.trans in ['T', 'C']: trans = 1 l, n = A.shape k, m = b.shape elif self.trans == 'N': trans = 0 n, l = A.shape k, m = b.shape else: raise ValueError('Invalid value for trans') if l != n: raise ValueError('A must be a square matrix') if n != k: raise ValueError('A and b must be aligned.') lda = max(1, n) ldb = max(1, k) # We copy A and b as cusolver operates inplace b = pygpu.array(b, copy=True, order='F') if not self.inplace: A = pygpu.array(A, copy=True) A_ptr = A.gpudata b_ptr = b.gpudata # cusolver expects a F ordered matrix, but A is not explicitly # converted between C and F order, instead we switch the # "transpose" flag. if A.flags['C_CONTIGUOUS']: trans = 1 - trans if A.dtype == 'float32': potrf_bufferSize = cusolver.cusolverDnSpotrf_bufferSize potrf = cusolver.cusolverDnSpotrf potrs = cusolverDnSpotrs getrf_bufferSize = cusolver.cusolverDnSgetrf_bufferSize getrf = cusolver.cusolverDnSgetrf getrs = cusolver.cusolverDnSgetrs elif A.dtype == 'float64': potrf_bufferSize = cusolver.cusolverDnDpotrf_bufferSize potrf = cusolver.cusolverDnDpotrf potrs = cusolverDnDpotrs getrf_bufferSize = cusolver.cusolverDnDgetrf_bufferSize getrf = cusolver.cusolverDnDgetrf getrs = cusolver.cusolverDnDgetrs else: raise ValueError("Unsupported dtype") if self.A_structure == 'symmetric': with context: workspace_size = potrf_bufferSize( context.cusolver_handle, 0, n, A_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata dev_info_ptr = dev_info.gpudata with context: potrf( context.cusolver_handle, 0, n, A_ptr, lda, workspace_ptr, workspace_size, dev_info_ptr) self.check_dev_info(dev_info) potrs( context.cusolver_handle, 0, n, m, A_ptr, lda, b_ptr, ldb, dev_info_ptr) else: # general case for A with context: workspace_size = getrf_bufferSize( context.cusolver_handle, n, n, A_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) pivots = pygpu.zeros(n, dtype='int32', context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata pivots_ptr = pivots.gpudata dev_info_ptr = dev_info.gpudata with context: getrf( context.cusolver_handle, n, n, A_ptr, lda, workspace_ptr, pivots_ptr, dev_info_ptr) self.check_dev_info(dev_info) getrs( context.cusolver_handle, trans, n, m, A_ptr, lda, pivots_ptr, b_ptr, ldb, dev_info_ptr) z[0] = b def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py A, b = inputs c = outputs[0] c_bar = output_gradients[0] # FIXME: triangular structure would use GpuCublasTriangularsolve? # no need to handle A_structure like slinalg.py? trans_solve_op = GpuCusolverSolve('general') b_bar = trans_solve_op(A.T, c_bar) A_bar = -tensor.outer(b_bar, c) if c.ndim == 1 else -b_bar.dot(c.T) return [A_bar, b_bar] class GpuCublasTriangularSolve(Op): """ CUBLAS GPU Triangular Solve Op. Parameters ---------- lower Whether system is lower-triangular (True) or upper-triangular (False). trans Whether to take the transpose of the input matrix or not. """ __props__ = ('trans', 'lower') def __init__(self, lower=True, trans='N'): self.trans = trans self.lower = lower super(GpuCublasTriangularSolve, self).__init__() def make_node(self, inp1, inp2): if not cublas_available: raise RuntimeError('CUBLAS is not available and ' 'GpuCublasTriangularSolve Op ' 'can not be constructed.') context_name = infer_context_name(inp1, inp2) inp1 = as_gpuarray_variable(inp1, context_name) inp2 = as_gpuarray_variable(inp2, context_name) inp1 = gpu_contiguous(inp1) inp2 = gpu_contiguous(inp2) assert inp1.ndim == 2 assert inp2.ndim in [1, 2] assert inp1.dtype == inp2.dtype return theano.Apply(self, [inp1, inp2], [GpuArrayType(inp1.dtype, broadcastable=inp2.broadcastable, context_name=context_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cublas_handle_to_context(ctx) def perform(self, node, inputs, outputs): ctx = node.inputs[0].type.context # Solution set x = outputs[0] # Matrix. A = inputs[0] # right hand side b = inputs[1] assert(len(A.shape) == 2) assert(len(b.shape) in [1, 2]) # implicitly deal with the difference between C order # and fortran order by flipping the trans and lower flags lower = not self.lower trans = self.trans if trans in ['T', 'C']: trans = 'N' l, n = A.shape elif trans == 'N': trans = 'T' n, l = A.shape else: raise ValueError('Invalid value for trans') if b.ndim == 2: k, m = b.shape else: k, = b.shape m = 1 if l != n: raise ValueError('A must be a square matrix') if n != k: raise ValueError('A and b must be aligned.') lda = max(1, n) ldb = max(1, k) # solution overwrites right hand side on exit b = pygpu.array(b, copy=True, order='F') A_ptr = A.gpudata b_ptr = b.gpudata # unit scalar used for multiplication alpha = 1.0 # indicates matrix A is on left of B side = 'l' # set whether upper or lower part of matrix A stored uplo = 'l' if lower else 'u' # indicates elements on diagonal of matrix A may not be unity diag = 'n' if A.dtype == 'float32': trsv = cublas.cublasStrsv trsm = cublas.cublasStrsm elif A.dtype == 'float64': trsv = cublas.cublasDtrsv trsm = cublas.cublasDtrsm else: raise ValueError("Unsupported dtype") with ctx: if b.ndim == 1: # matrix vector solve trsv(ctx.cublas_handle, uplo, trans, diag, n, A_ptr, lda, b_ptr, 1) else: trsm(ctx.cublas_handle, side, uplo, trans, diag, n, m, alpha, A_ptr, lda, b_ptr, ldb) x[0] = b def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py A, b = inputs c = outputs[0] c_bar = output_gradients[0] trans_solve_op = GpuCublasTriangularSolve(not self.lower) b_bar = trans_solve_op(A.T, c_bar) A_bar = -tensor.outer(b_bar, c) if c.ndim == 1 else -b_bar.dot(c.T) if self.lower: A_bar = tensor.tril(A_bar) else: A_bar = tensor.triu(A_bar) return [A_bar, b_bar] def gpu_solve(A, b, A_structure='general', trans='N'): if A_structure == 'lower': return GpuCublasTriangularSolve(True, trans)(A, b) elif A_structure == 'upper': return GpuCublasTriangularSolve(False, trans)(A, b) return GpuCusolverSolve(A_structure, trans)(A, b) def gpu_solve_lower_triangular(A, b, trans='N'): return GpuCublasTriangularSolve(True, trans)(A, b) def gpu_solve_upper_triangular(A, b, trans='N'): return GpuCublasTriangularSolve(False, trans)(A, b) class GpuCholesky(Op): """ CUSOLVER GPU Cholesky Op. Given a real positive definite matrix `A` returns either a lower triangular matrix `L` such that `A == dot(L, L.T)` if `lower == True` else returns an upper triangular matrix `U` such that `A == dot(U.T, U)` if `lower == False`. Parameters ---------- lower Whether to return a lower rather than upper triangular decomposition. """ __props__ = ('lower', 'inplace') def __init__(self, lower=True, inplace=False): self.lower = lower self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} super(GpuCholesky, self).__init__() def clone_inplace(self): return self.__class__(lower=self.lower, inplace=True) def make_node(self, inp): if not cusolver_available: raise RuntimeError('CUSOLVER is not available and ' 'GpuCholesky Op can not be constructed.') if skcuda.__version__ <= '0.5.1': warnings.warn('The GpuCholesky op requires scikit-cuda > ' '0.5.1 to work with CUDA 8') if not pygpu_available: raise RuntimeError('Missing pygpu or triu/tril functions.' 'Install or update libgpuarray.') context_name = infer_context_name(inp) inp = as_gpuarray_variable(inp, context_name) inp = gpu_contiguous(inp) assert inp.ndim == 2 return theano.Apply(self, [inp], [inp.type()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cusolver_handle_to_context(ctx) def perform(self, node, inputs, outputs): context = inputs[0][0].context # Input matrix. A = inputs[0] l, n = A.shape if l != n: raise ValueError('A must be a square matrix') lda = max(1, n) # cusolver operates on F ordered matrices, but A is expected # to be symmetric so it does not matter. # We copy A if needed if self.inplace: L = A else: L = pygpu.array(A, copy=True) # The output matrix will contain only the upper or lower # triangular factorization of A. If L is C ordered (it # probably is as it is the default in Theano) we just switch # the fill mode parameter of cusolver l_parameter = 0 if self.lower else 1 if L.flags['C_CONTIGUOUS']: l_parameter = 1 - l_parameter L_ptr = L.gpudata if A.dtype == 'float32': potrf_bufferSize = cusolver.cusolverDnSpotrf_bufferSize potrf = cusolver.cusolverDnSpotrf elif A.dtype == 'float64': potrf_bufferSize = cusolver.cusolverDnDpotrf_bufferSize potrf = cusolver.cusolverDnDpotrf else: raise ValueError("Unsupported dtype") with context: workspace_size = potrf_bufferSize( context.cusolver_handle, l_parameter, n, L_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata dev_info_ptr = dev_info.gpudata potrf(context.cusolver_handle, l_parameter, n, L_ptr, lda, workspace_ptr, workspace_size, dev_info_ptr) val_dev_info = np.asarray(dev_info)[0] if val_dev_info > 0: raise LinAlgError('Cholesky decomposition failed (is A SPD?)') # cusolver leaves the elements in the matrix outside the considered # upper or lower triangle unchanged, so we need to put zeros outside # the triangle if self.lower: tril(L) else: triu(L) outputs[0][0] = L def L_op(self, inputs, outputs, gradients): # Modified from theano/tensor/slinalg.py # No handling for on_error = 'nan' dz = gradients[0] chol_x = outputs[0] # this is for nan mode # # ok = ~tensor.any(tensor.isnan(chol_x)) # chol_x = tensor.switch(ok, chol_x, 1) # dz = tensor.switch(ok, dz, 1) # deal with upper triangular by converting to lower triangular if not self.lower: chol_x = chol_x.T dz = dz.T def tril_and_halve_diagonal(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" return tensor.tril(mtx) - tensor.diag(tensor.diagonal(mtx) / 2.) def conjugate_solve_triangular(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" return gpu_solve_upper_triangular( outer.T, gpu_solve_upper_triangular(outer.T, inner.T).T) s = conjugate_solve_triangular( chol_x, tril_and_halve_diagonal(chol_x.T.dot(dz))) if self.lower: grad = tensor.tril(s + s.T) - tensor.diag(tensor.diagonal(s)) else: grad = tensor.triu(s + s.T) - tensor.diag(tensor.diagonal(s)) return [grad] def gpu_cholesky(A, lower=True): return GpuCholesky(lower)(A) # TODO: add support for float64 class GpuMagmaBase(COp): """Base class for magma related operations. Add the necessary headers, libraries and optionally the location of headers and library. """ def c_headers(self): return ['gpuarray/types.h', 'gpuarray/array.h', 'gpuarray/ext_cuda.h', 'gpuarray_helper.h', 'magma.h'] def c_header_dirs(self): dirs = [gpuarray_helper_inc_dir(), pygpu.get_include(), config.cuda.include_path] if config.magma.include_path: dirs.append(config.magma.include_path) return dirs def c_libraries(self): return ['magma'] def c_lib_dirs(self): if config.magma.library_path: return [config.magma.library_path] return [] def prepare_node(self, node, storage_map, compute_map, impl): from skcuda.magma import magma_init ctx = node.inputs[0].type.context if not getattr(ctx, 'is_magma_initialized', False): with ctx: magma_init() ctx.is_magma_initialized = True class GpuMagmaSVD(GpuMagmaBase): """Computes the svd of a matrix :math:`A` using magma library. .. warning:: Because of implementation constraints, this Op returns outputs in order ``S, U, VT``. Use :func:`theano.gpuarray.linalg.gpu_svd` to get them in expected order ``U, S, VT``. """ __props__ = ('full_matrices', 'compute_uv') _cop_num_inputs = 1 _cop_num_outputs = 3 check_input = False params_type = ParamsType(full_matrices=bool_t, context=gpu_context_type) def __init__(self, full_matrices=True, compute_uv=True): self.full_matrices = full_matrices self.compute_uv = compute_uv COp.__init__(self, ['c_code/magma_svd.c'], 'APPLY_SPECIFIC(magma_svd)') def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") if self.compute_uv: return theano.Apply(self, [A], # return S, U, VT [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)(), A.type(), A.type()]) else: return theano.Apply(self, [A], # return only S [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): super(GpuMagmaSVD, self).prepare_node(node, storage_map, compute_map, impl) # Check node to prevent eventual errors with old pickled nodes. if self.compute_uv: A, B, C = node.outputs # We expect order: S (vector), U (matrix), VT (matrix) assert A.type.ndim == 1 and B.type.ndim == C.type.ndim == 2, \ "Due to implementation constraints, GpuMagmaSVD interface has changed and now returns (S, U, VT) " \ "instead of (U, S, VT). Either update your code, or use gpu_svd() to get the expected (U, S, VT) order." def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def infer_shape(self, node, shapes): x_shape, = shapes M, N = x_shape K = tensor.minimum(M, N) s_shape = (K, ) if self.compute_uv: u_shape = (M, M) if self.full_matrices else (M, K) vt_shape = (N, N) if self.full_matrices else (K, N) return [s_shape, u_shape, vt_shape] else: return [s_shape] def gpu_svd(a, full_matrices=1, compute_uv=1): """ This function performs the SVD on GPU. Parameters ---------- full_matrices : bool, optional If True (default), u and v have the shapes (M, M) and (N, N), respectively. Otherwise, the shapes are (M, K) and (K, N), respectively, where K = min(M, N). compute_uv : bool, optional Whether or not to compute u and v in addition to s. True by default. Returns ------- U, V, D : matrices """ out = GpuMagmaSVD(full_matrices, compute_uv)(a) if compute_uv: S, U, VT = out out = [U, S, VT] return out class GpuMagmaMatrixInverse(GpuMagmaBase): """Computes the inverse of a matrix :math:`A` using magma library. """ __props__ = ('inplace', ) check_input = False params_type = ParamsType(inplace=bool_t, context=gpu_context_type) def __init__(self, inplace=False): COp.__init__(self, ['c_code/magma_inv.c'], 'APPLY_SPECIFIC(magma_inv)') self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} def clone_inplace(self): return self.__class__(inplace=True) def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") return theano.Apply(self, [A], [A.type()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def infer_shape(self, node, shapes): return shapes def gpu_matrix_inverse(a): """ This function performs the matrix inverse on GPU. Returns ------- a_inv: matrix """ return GpuMagmaMatrixInverse()(a) class GpuMagmaCholesky(GpuMagmaBase, CGpuKernelBase): """Computes the cholesky decomposition of a matrix :math:`A` using magma library. """ __props__ = ('lower', 'inplace') check_input = False params_type = ParamsType(lower=bool_t, inplace=bool_t, context=gpu_context_type) def __init__(self, lower=True, inplace=False): self.lower = lower COp.__init__(self, ['c_code/magma_cholesky.c'], 'APPLY_SPECIFIC(magma_cholesky)') self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} def clone_inplace(self): return self.__class__(lower=self.lower, inplace=True) def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") #if A.dtype != 'float32': # PMB removed this check # raise TypeError("only `float32` is supported for now") return theano.Apply(self, [A], [A.type()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def infer_shape(self, node, shapes): return [shapes[0]] def L_op(self, inputs, outputs, gradients): # Modified from theano/tensor/slinalg.py # No handling for on_error = 'nan' dz = gradients[0] chol_x = outputs[0] # this is for nan mode # # ok = ~tensor.any(tensor.isnan(chol_x)) # chol_x = tensor.switch(ok, chol_x, 1) # dz = tensor.switch(ok, dz, 1) # deal with upper triangular by converting to lower triangular if not self.lower: chol_x = chol_x.T dz = dz.T def tril_and_halve_diagonal(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" return tensor.tril(mtx) - tensor.diag(tensor.diagonal(mtx) / 2.) def conjugate_solve_triangular(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" return gpu_solve_upper_triangular( outer.T, gpu_solve_upper_triangular(outer.T, inner.T).T) s = conjugate_solve_triangular( chol_x, tril_and_halve_diagonal(chol_x.T.dot(dz))) if self.lower: grad = tensor.tril(s + s.T) - tensor.diag(tensor.diagonal(s)) else: grad = tensor.triu(s + s.T) - tensor.diag(tensor.diagonal(s)) return [grad] class GpuMagmaQR(GpuMagmaBase, CGpuKernelBase): """Computes the qr decomposition of a matrix :math:`A` using magma library. Parameters ---------- complete : If False, returns only ``R``. .. warning:: Because of implementation constraints, this Op returns outputs in order ``R, Q``. Use :func:`theano.gpuarray.linalg.gpu_qr` to get them in expected order ``Q, R``. """ __props__ = ('complete', ) _cop_num_inputs = 1 _cop_num_outputs = 2 check_input = False params_type = ParamsType(complete=bool_t, context=gpu_context_type) def __init__(self, complete=True): self.complete = complete COp.__init__(self, ['c_code/magma_qr.c'], 'APPLY_SPECIFIC(magma_qr)') def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") if self.complete: return theano.Apply(self, [A], # return R, Q [A.type(), A.type()]) else: return theano.Apply(self, [A], # return R [A.type()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def gpu_qr(a, complete=True): """ This function performs the QR on GPU. Parameters ---------- complete : bool, optional If `False`, returns only r. Returns ------- Q, R : matrices """ out = GpuMagmaQR(complete)(a) if complete: R, Q = out out = [Q, R] return out class GpuMagmaEigh(GpuMagmaBase): """Computes the eigen decomposition of a symmetric matrix :math:`A` using magma library. Parameters ---------- UPLO : Specifies whether the calculation is done with the lower triangular part of matrix (`L`, default) or the upper triangular part (`U`). compute_v : If `True`, computes eigenvalues and eigenvectors (`True`, default). If `False`, computes only eigenvalues of matrix. """ __props__ = ('lower', 'compute_v') _cop_num_inputs = 1 _cop_num_outputs = 2 check_input = False params_type = ParamsType(lower=bool_t, compute_v=bool_t, context=gpu_context_type) def __init__(self, UPLO='L', compute_v=True): assert UPLO in ['L', 'U'] self.lower = UPLO == 'L' self.compute_v = compute_v COp.__init__(self, ['c_code/magma_eigh.c'], 'APPLY_SPECIFIC(magma_eigh)') def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") if self.compute_v: return theano.Apply(self, [A], # return D, V [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)(), A.type()]) else: return theano.Apply(self, [A], # return D [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context)
#section kernels #kernel tril_kernel : size, size, size, *: #include "cluda.h" KERNEL void tril_kernel(const ga_size nthreads, const ga_size ncols, const ga_size a_off, GLOBAL_MEM DTYPE_INPUT_0 *a) { a = (GLOBAL_MEM DTYPE_INPUT_0 *)(((char *)a) + a_off); // grid stride looping for (ga_size index = GID_0 * LDIM_0 + LID_0; index < nthreads; index += LDIM_0 * GDIM_0) { unsigned int ix = index / ncols; unsigned int iy = index % ncols; if (ix < iy) { a[index] = 0.0; } } } #kernel triu_kernel : size, size, size, *: #include "cluda.h" KERNEL void triu_kernel(const ga_size nthreads, const ga_size ncols, const ga_size a_off, GLOBAL_MEM DTYPE_INPUT_0 *a) { a = (GLOBAL_MEM DTYPE_INPUT_0 *)(((char *)a) + a_off); // grid stride looping for (ga_size index = GID_0 * LDIM_0 + LID_0; index < nthreads; index += LDIM_0 * GDIM_0) { unsigned int ix = index / ncols; unsigned int iy = index % ncols; if (ix > iy) { a[index] = 0.0; } } } #section init_code setup_ext_cuda(); #section support_code_struct int APPLY_SPECIFIC(magma_cholesky)(PyGpuArrayObject *A, PyGpuArrayObject **L, PARAMS_TYPE *params) { const size_t *dims; size_t N, n2; magma_uplo_t ul; int res = -1, info; if (A->ga.typecode != GA_FLOAT && A->ga.typecode != GA_DOUBLE) { PyErr_SetString(PyExc_TypeError, "GpuMagmaCholesky: unsupported data type"); return -1; } // This is early to match the exit() in the fail label. cuda_enter(params->context->ctx); if (!GpuArray_IS_C_CONTIGUOUS(&A->ga)) { PyErr_SetString(PyExc_ValueError, "GpuMagmaCholesky: requires data to be C-contiguous"); goto fail; } if (PyGpuArray_NDIM(A) != 2) { PyErr_SetString(PyExc_ValueError, "GpuMagmaCholesky: matrix rank error"); goto fail; } dims = PyGpuArray_DIMS(A); if (dims[0] != dims[1]) { PyErr_SetString(PyExc_ValueError, "GpuMagmaCholesky: matrix is not square"); goto fail; } if (params->inplace) { Py_XDECREF(*L); *L = A; Py_INCREF(*L); } else { *L = theano_try_copy(*L, A); if (*L == NULL) { PyErr_SetString( PyExc_RuntimeError, "GpuMagmaCholesky: failed to allocate memory for the output"); goto fail; } } // magma matrix cholesky N = dims[0]; n2 = N * N; // Magma requires column-major order for the matrix A. Instead of changing // matrix order which requires copying data, we can compute cholesky // decomposition where we change parameters lower to upper and upper to // lower. if (params->lower) { ul = MagmaUpper; } else { ul = MagmaLower; } if (A->ga.typecode == GA_FLOAT ) { magma_spotrf_gpu(ul, N, (float *)PyGpuArray_DEV_DATA(*L), N, &info); } else { magma_dpotrf_gpu(ul, N, (double *)PyGpuArray_DEV_DATA(*L), N, &info); } if (info > 0) { PyErr_Format(PyExc_RuntimeError, "GpuMagmaCholesky: the leading minor of " "order %d is not positive definite", info); goto fail; } else if (info < 0) { PyErr_Format( PyExc_RuntimeError, "GpuMagmaCholesky: magma_spotrf_gpu/magma_spotrf_gpu argument %d has an illegal value", -info); goto fail; } if (params->lower) { res = tril_kernel_scall(1, &n2, 0, n2, N, (*L)->ga.offset, (*L)->ga.data); if (res != GA_NO_ERROR) { PyErr_Format(PyExc_RuntimeError, "GpuMagmaCholesky: tril_kernel %s.", GpuKernel_error(&k_tril_kernel, res)); goto fail; } } else { res = triu_kernel_scall(1, &n2, 0, n2, N, (*L)->ga.offset, (*L)->ga.data); if (res != GA_NO_ERROR) { PyErr_Format(PyExc_RuntimeError, "GpuMagmaCholesky: triu_kernel %s.", GpuKernel_error(&k_triu_kernel, res)); goto fail; } } res = 0; fail: cuda_exit(params->context->ctx); return res; }