Hi wonghang, I am trying to develop a batched version of both GpuCholesky and GpuCublasTriangularSolve(). I was confused about what to do in L_op with the extra batch dimension, but after carefully reading the docs again, I now understand it. L_op is just a linearized and transposed version of Op, so the input of L_op has the same shape and form as the output of Op, and vice-versa. Now I got it. These attached functions are tested and work on batches. Let 'nb' be the batch size, so Cll is dimensioned (nb,n,n) , and dl is the RHS, dimensioned (nb,n). Then to solve er = Cll * dl, for dl, I use:
R = theano.gpuarray.linalg.GpuCholeskyBatch(lower=True)(Cll) lam1= theano.gpuarray.linalg.GpuCublasTriangularSolveBatch(lower=True)(R,er) R_T=R.dimshuffle(0,2,1) dl= theano.gpuarray.linalg.GpuCublasTriangularSolveBatch(lower=False)(R_T,lam1) This does the classical Cholesky followed by back-substitution. This code runs in 4.0 seconds. When I use the scan on the non-batched versions, it runs in 10.3 sec. So it is 2.5 times faster! That makes a big difference to me as I can get a result before I go home. Paul On Tuesday, February 11, 2020 at 4:22:20 AM UTC+1, Wong Hang wrote: > > Hi Paul, > > I am not quite sure what you are going to do. > > If you want a batch version of Cholesky and TriangularSolve, then you will > end up with a "nb" copy of gradients. > What are you going to do with that "nb" copy of gradients? > AFAIK, theano.grad only accept scalar input. If you need a Jacobian, you > can only implement it by theano.scan and I know theano.scan is inefficient. > > You may be interested in this thread: > https://groups.google.com/d/msg/theano-users/Rg8ZIru-pgo/DgvwY57RBwAJ > > Best, > wonghang > > Paul Baggenstoss <p.m.ba...@ieee.org <javascript:>> 於 2020年2月10日 週一 > 下午8:46寫道: > >> Hi Wonghang, >> So I am working toward making the Cholesky problem faster, and by >> that I include triangular >> solvers like GpuCublasTriangularSolve(). We typically do the Cholesky >> decomp, then solve >> linear systems involving the Cholesky matrix that has an upper or lower >> triangular form. >> >> So I have started with GpuCublasTriangularSolve(). It has a lot of >> overhead from >> all the gpu array matrix copying and creation, which adds to the overhead >> of >> theano.scan, which is necessary to work with batches. So I thought it >> would be much >> better to have a batched version of Cholesky() and >> GpuCublasTriangularSolve(). I have >> created working versions of these batched routines (attached). It's >> actually pretty simple, >> I just index into the data by batch, then call the gpu solver, potrf() >> for Cholesky >> and trsv() for the solver. I loop over the batch like this: >> >> if theano.config.floatX=='float32': >> wordlen=4 >> else: >> wordlet=8 >> for ib in range(nb): >> trsv(ctx.cublas_handle, uplo, trans, diag, n, >> A_ptr+ib*n*n*wordlen, lda, b_ptr+ib*n*m*wordlen, 1) >> >> >> There are a few small gotchas, such as that the GPU routines expect >> F-ordered data, but >> to index like I did above, it has to be C-ordered. So the data has to be >> C-ordered by batch, but F-ordered within the batch! >> >> The problem I have, where I will need help, is in the gradients. >> Although I can compute the gradient in L_op correctly (I think), >> theano.gradient is not >> happy with the shape of the matrices that L_op returns. This is probably >> because gradient does not understand that the data is bacthed. >> >> Do you think you can help in this matter? I think this is over my head. >> Paul >> >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "theano-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to theano...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/theano-users/b1336e9d-9601-48e9-9665-58740618a861%40googlegroups.com >> >> <https://groups.google.com/d/msgid/theano-users/b1336e9d-9601-48e9-9665-58740618a861%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/9213049c-5277-470d-acdb-5f4e3a3a3702%40googlegroups.com.
from __future__ import absolute_import, division, print_function import warnings import pkg_resources import numpy as np from numpy.linalg.linalg import LinAlgError import theano from theano import Op, config, tensor from theano.scalar import bool as bool_t from theano.gof import COp, ParamsType from theano.gpuarray import GpuArrayType from .basic_ops import (CGpuKernelBase, as_gpuarray_variable, gpu_contiguous, gpuarray_helper_inc_dir, infer_context_name) from .type import gpu_context_type try: import pygpu from pygpu.basic import triu, tril pygpu_available = True except ImportError: pygpu_available = False cusolver_available = False try: import skcuda from skcuda import cusolver cusolver_available = True except (ImportError, OSError, RuntimeError, pkg_resources.DistributionNotFound): pass cublas_available = False try: from skcuda import cublas cublas_available = True except (ImportError, OSError, RuntimeError, pkg_resources.DistributionNotFound): pass if cusolver_available: # Add cusolver call as it is missing in skcuda # SPOTRS cusolver._libcusolver.cusolverDnSpotrs.restype = int cusolver._libcusolver.cusolverDnSpotrs.argtypes = [cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p] def cusolverDnSpotrs(handle, uplo, n, nrhs, A, lda, B, ldb, devInfo): """ Solve real single precision linear system for hermitian matrices. References ---------- `cusolverDn<t>potrs <http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-potrs>`_ """ status = cusolver._libcusolver.cusolverDnSpotrs(handle, uplo, n, nrhs, int(A), lda, int(B), ldb, int(devInfo)) cusolver.cusolverCheckStatus(status) # DPOTRS # TODO: Are they still missing in skucda? cusolver._libcusolver.cusolverDnDpotrs.restype = int cusolver._libcusolver.cusolverDnDpotrs.argtypes = [cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p, cusolver.ctypes.c_int, cusolver.ctypes.c_void_p] def cusolverDnDpotrs(handle, uplo, n, nrhs, A, lda, B, ldb, devInfo): status = cusolver._libcusolver.cusolverDnDpotrs(handle, uplo, n, nrhs, int(A), lda, int(B), ldb, int(devInfo)) cusolver.cusolverCheckStatus(status) def attach_cusolver_handle_to_context(ctx): handle = getattr(ctx, 'cusolver_handle', None) if handle is None: with ctx: ctx.cusolver_handle = cusolver.cusolverDnCreate() def attach_cublas_handle_to_context(ctx): handle = getattr(ctx, 'cublas_handle', None) if handle is None: with ctx: ctx.cublas_handle = cublas.cublasCreate() # it is a subset of all cases available in slinalg's MATRIX_STRUCTURE MATRIX_STRUCTURES_SOLVE = ( 'general', 'symmetric', 'lower_triangular', 'upper_triangular') class GpuCusolverSolve(Op): """ CUSOLVER GPU solver OP. Parameters ---------- trans Whether to take the transpose of the input matrix or not. """ __props__ = ('A_structure', 'trans', 'inplace') def __init__(self, A_structure='general', trans='N', inplace=False): self.trans = trans self.inplace = inplace self.A_structure = A_structure if self.inplace: self.destroy_map = {0: [0]} assert A_structure in MATRIX_STRUCTURES_SOLVE super(GpuCusolverSolve, self).__init__() def make_node(self, inp1, inp2): if not cusolver_available: raise RuntimeError('CUSOLVER is not available and ' 'GpuCusolverSolve Op can not be constructed.') if skcuda.__version__ <= '0.5.1': warnings.warn('The GpuSolve op requires scikit-cuda > 0.5.1 to work with CUDA 8') context_name = infer_context_name(inp1, inp2) inp1 = as_gpuarray_variable(inp1, context_name) inp2 = as_gpuarray_variable(inp2, context_name) inp1 = gpu_contiguous(inp1) inp2 = gpu_contiguous(inp2) assert inp1.ndim == 2 assert inp2.ndim == 2 assert inp1.dtype == inp2.dtype return theano.Apply( self, [inp1, inp2], [GpuArrayType(inp1.dtype, broadcastable=inp1.broadcastable, context_name=context_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cusolver_handle_to_context(ctx) def check_dev_info(self, dev_info): val = np.asarray(dev_info)[0] if val > 0: raise LinAlgError('A is singular') def perform(self, node, inputs, outputs): context = inputs[0][0].context # Size of the matrices to invert. z = outputs[0] # Matrix. A = inputs[0] # Solution vectors. b = inputs[1] assert(len(A.shape) == 2) assert(len(b.shape) == 2) if self.trans in ['T', 'C']: trans = 1 l, n = A.shape k, m = b.shape elif self.trans == 'N': trans = 0 n, l = A.shape k, m = b.shape else: raise ValueError('Invalid value for trans') if l != n: raise ValueError('A must be a square matrix') if n != k: raise ValueError('A and b must be aligned.') lda = max(1, n) ldb = max(1, k) # We copy A and b as cusolver operates inplace b = pygpu.array(b, copy=True, order='F') if not self.inplace: A = pygpu.array(A, copy=True) A_ptr = A.gpudata b_ptr = b.gpudata # cusolver expects a F ordered matrix, but A is not explicitly # converted between C and F order, instead we switch the # "transpose" flag. if A.flags['C_CONTIGUOUS']: trans = 1 - trans if A.dtype == 'float32': potrf_bufferSize = cusolver.cusolverDnSpotrf_bufferSize potrf = cusolver.cusolverDnSpotrf potrs = cusolverDnSpotrs getrf_bufferSize = cusolver.cusolverDnSgetrf_bufferSize getrf = cusolver.cusolverDnSgetrf getrs = cusolver.cusolverDnSgetrs elif A.dtype == 'float64': potrf_bufferSize = cusolver.cusolverDnDpotrf_bufferSize potrf = cusolver.cusolverDnDpotrf potrs = cusolverDnDpotrs getrf_bufferSize = cusolver.cusolverDnDgetrf_bufferSize getrf = cusolver.cusolverDnDgetrf getrs = cusolver.cusolverDnDgetrs else: raise ValueError("Unsupported dtype") if self.A_structure == 'symmetric': with context: workspace_size = potrf_bufferSize( context.cusolver_handle, 0, n, A_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata dev_info_ptr = dev_info.gpudata with context: potrf( context.cusolver_handle, 0, n, A_ptr, lda, workspace_ptr, workspace_size, dev_info_ptr) self.check_dev_info(dev_info) potrs( context.cusolver_handle, 0, n, m, A_ptr, lda, b_ptr, ldb, dev_info_ptr) else: # general case for A with context: workspace_size = getrf_bufferSize( context.cusolver_handle, n, n, A_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) pivots = pygpu.zeros(n, dtype='int32', context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata pivots_ptr = pivots.gpudata dev_info_ptr = dev_info.gpudata with context: getrf( context.cusolver_handle, n, n, A_ptr, lda, workspace_ptr, pivots_ptr, dev_info_ptr) self.check_dev_info(dev_info) getrs( context.cusolver_handle, trans, n, m, A_ptr, lda, pivots_ptr, b_ptr, ldb, dev_info_ptr) z[0] = b def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py A, b = inputs c = outputs[0] c_bar = output_gradients[0] # FIXME: triangular structure would use GpuCublasTriangularsolve? # no need to handle A_structure like slinalg.py? trans_solve_op = GpuCusolverSolve('general') b_bar = trans_solve_op(A.T, c_bar) A_bar = -tensor.outer(b_bar, c) if c.ndim == 1 else -b_bar.dot(c.T) return [A_bar, b_bar] class GpuCublasTriangularSolve(Op): """ CUBLAS GPU Triangular Solve Op. Parameters ---------- lower Whether system is lower-triangular (True) or upper-triangular (False). trans Whether to take the transpose of the input matrix or not. """ __props__ = ('trans', 'lower') def __init__(self, lower=True, trans='N'): self.trans = trans self.lower = lower super(GpuCublasTriangularSolve, self).__init__() def make_node(self, inp1, inp2): if not cublas_available: raise RuntimeError('CUBLAS is not available and ' 'GpuCublasTriangularSolve Op ' 'can not be constructed.') context_name = infer_context_name(inp1, inp2) inp1 = as_gpuarray_variable(inp1, context_name) inp2 = as_gpuarray_variable(inp2, context_name) inp1 = gpu_contiguous(inp1) inp2 = gpu_contiguous(inp2) assert inp1.ndim == 2 assert inp2.ndim in [1, 2] assert inp1.dtype == inp2.dtype return theano.Apply(self, [inp1, inp2], [GpuArrayType(inp1.dtype, broadcastable=inp2.broadcastable, context_name=context_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cublas_handle_to_context(ctx) def perform(self, node, inputs, outputs): ctx = node.inputs[0].type.context # Solution set x = outputs[0] # Matrix. A = inputs[0] # right hand side b = inputs[1] assert(len(A.shape) == 2) assert(len(b.shape) in [1, 2]) # implicitly deal with the difference between C order # and fortran order by flipping the trans and lower flags lower = not self.lower trans = self.trans if trans in ['T', 'C']: trans = 'N' l, n = A.shape elif trans == 'N': trans = 'T' n, l = A.shape else: raise ValueError('Invalid value for trans') if b.ndim == 2: k, m = b.shape else: k, = b.shape m = 1 if l != n: raise ValueError('A must be a square matrix') if n != k: raise ValueError('A and b must be aligned.') lda = max(1, n) ldb = max(1, k) # solution overwrites right hand side on exit b = pygpu.array(b, copy=True, order='F') A_ptr = A.gpudata b_ptr = b.gpudata # unit scalar used for multiplication alpha = 1.0 # indicates matrix A is on left of B side = 'l' # set whether upper or lower part of matrix A stored uplo = 'l' if lower else 'u' # indicates elements on diagonal of matrix A may not be unity diag = 'n' if A.dtype == 'float32': trsv = cublas.cublasStrsv trsm = cublas.cublasStrsm elif A.dtype == 'float64': trsv = cublas.cublasDtrsv trsm = cublas.cublasDtrsm else: raise ValueError("Unsupported dtype") with ctx: if b.ndim == 1: # matrix vector solve trsv(ctx.cublas_handle, uplo, trans, diag, n, A_ptr, lda, b_ptr, 1) else: trsm(ctx.cublas_handle, side, uplo, trans, diag, n, m, alpha, A_ptr, lda, b_ptr, ldb) x[0] = b def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py A, b = inputs c = outputs[0] c_bar = output_gradients[0] trans_solve_op = GpuCublasTriangularSolve(not self.lower) b_bar = trans_solve_op(A.T, c_bar) A_bar = -tensor.outer(b_bar, c) if c.ndim == 1 else -b_bar.dot(c.T) if self.lower: A_bar = tensor.tril(A_bar) else: A_bar = tensor.triu(A_bar) return [A_bar, b_bar] def gpu_solve(A, b, A_structure='general', trans='N'): if A_structure == 'lower': return GpuCublasTriangularSolve(True, trans)(A, b) elif A_structure == 'upper': return GpuCublasTriangularSolve(False, trans)(A, b) return GpuCusolverSolve(A_structure, trans)(A, b) def gpu_solve_lower_triangular(A, b, trans='N'): return GpuCublasTriangularSolve(True, trans)(A, b) def gpu_solve_upper_triangular(A, b, trans='N'): return GpuCublasTriangularSolve(False, trans)(A, b) class GpuCholesky(Op): """ CUSOLVER GPU Cholesky Op. Given a real positive definite matrix `A` returns either a lower triangular matrix `L` such that `A == dot(L, L.T)` if `lower == True` else returns an upper triangular matrix `U` such that `A == dot(U.T, U)` if `lower == False`. Parameters ---------- lower Whether to return a lower rather than upper triangular decomposition. """ __props__ = ('lower', 'inplace') def __init__(self, lower=True, inplace=False): self.lower = lower self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} super(GpuCholesky, self).__init__() def clone_inplace(self): return self.__class__(lower=self.lower, inplace=True) def make_node(self, inp): if not cusolver_available: raise RuntimeError('CUSOLVER is not available and ' 'GpuCholesky Op can not be constructed.') if skcuda.__version__ <= '0.5.1': warnings.warn('The GpuCholesky op requires scikit-cuda > ' '0.5.1 to work with CUDA 8') if not pygpu_available: raise RuntimeError('Missing pygpu or triu/tril functions.' 'Install or update libgpuarray.') context_name = infer_context_name(inp) inp = as_gpuarray_variable(inp, context_name) inp = gpu_contiguous(inp) assert inp.ndim == 2 return theano.Apply(self, [inp], [inp.type()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cusolver_handle_to_context(ctx) def perform(self, node, inputs, outputs): context = inputs[0][0].context # Input matrix. A = inputs[0] l, n = A.shape if l != n: raise ValueError('A must be a square matrix') lda = max(1, n) # cusolver operates on F ordered matrices, but A is expected # to be symmetric so it does not matter. # We copy A if needed if self.inplace: L = A else: L = pygpu.array(A, copy=True) # The output matrix will contain only the upper or lower # triangular factorization of A. If L is C ordered (it # probably is as it is the default in Theano) we just switch # the fill mode parameter of cusolver l_parameter = 0 if self.lower else 1 if L.flags['C_CONTIGUOUS']: l_parameter = 1 - l_parameter L_ptr = L.gpudata if A.dtype == 'float32': potrf_bufferSize = cusolver.cusolverDnSpotrf_bufferSize potrf = cusolver.cusolverDnSpotrf elif A.dtype == 'float64': potrf_bufferSize = cusolver.cusolverDnDpotrf_bufferSize potrf = cusolver.cusolverDnDpotrf else: raise ValueError("Unsupported dtype") with context: workspace_size = potrf_bufferSize( context.cusolver_handle, l_parameter, n, L_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata dev_info_ptr = dev_info.gpudata potrf(context.cusolver_handle, l_parameter, n, L_ptr, lda, workspace_ptr, workspace_size, dev_info_ptr) val_dev_info = np.asarray(dev_info)[0] if val_dev_info > 0: raise LinAlgError('Cholesky decomposition failed (is A SPD?)') # cusolver leaves the elements in the matrix outside the considered # upper or lower triangle unchanged, so we need to put zeros outside # the triangle if self.lower: tril(L) else: triu(L) outputs[0][0] = L def L_op(self, inputs, outputs, gradients): # Modified from theano/tensor/slinalg.py # No handling for on_error = 'nan' dz = gradients[0] chol_x = outputs[0] # this is for nan mode # # ok = ~tensor.any(tensor.isnan(chol_x)) # chol_x = tensor.switch(ok, chol_x, 1) # dz = tensor.switch(ok, dz, 1) # deal with upper triangular by converting to lower triangular if not self.lower: chol_x = chol_x.T dz = dz.T def tril_and_halve_diagonal(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" return tensor.tril(mtx) - tensor.diag(tensor.diagonal(mtx) / 2.) def conjugate_solve_triangular(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" return gpu_solve_upper_triangular( outer.T, gpu_solve_upper_triangular(outer.T, inner.T).T) s = conjugate_solve_triangular( chol_x, tril_and_halve_diagonal(chol_x.T.dot(dz))) if self.lower: grad = tensor.tril(s + s.T) - tensor.diag(tensor.diagonal(s)) else: grad = tensor.triu(s + s.T) - tensor.diag(tensor.diagonal(s)) return [grad] def gpu_cholesky(A, lower=True): return GpuCholesky(lower)(A) # TODO: add support for float64 class GpuMagmaBase(COp): """Base class for magma related operations. Add the necessary headers, libraries and optionally the location of headers and library. """ def c_headers(self): return ['gpuarray/types.h', 'gpuarray/array.h', 'gpuarray/ext_cuda.h', 'gpuarray_helper.h', 'magma.h'] def c_header_dirs(self): dirs = [gpuarray_helper_inc_dir(), pygpu.get_include(), config.cuda.include_path] if config.magma.include_path: dirs.append(config.magma.include_path) return dirs def c_libraries(self): return ['magma'] def c_lib_dirs(self): if config.magma.library_path: return [config.magma.library_path] return [] def prepare_node(self, node, storage_map, compute_map, impl): from skcuda.magma import magma_init ctx = node.inputs[0].type.context if not getattr(ctx, 'is_magma_initialized', False): with ctx: magma_init() ctx.is_magma_initialized = True class GpuMagmaSVD(GpuMagmaBase): """Computes the svd of a matrix :math:`A` using magma library. .. warning:: Because of implementation constraints, this Op returns outputs in order ``S, U, VT``. Use :func:`theano.gpuarray.linalg.gpu_svd` to get them in expected order ``U, S, VT``. """ __props__ = ('full_matrices', 'compute_uv') _cop_num_inputs = 1 _cop_num_outputs = 3 check_input = False params_type = ParamsType(full_matrices=bool_t, context=gpu_context_type) def __init__(self, full_matrices=True, compute_uv=True): self.full_matrices = full_matrices self.compute_uv = compute_uv COp.__init__(self, ['c_code/magma_svd.c'], 'APPLY_SPECIFIC(magma_svd)') def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") if self.compute_uv: return theano.Apply(self, [A], # return S, U, VT [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)(), A.type(), A.type()]) else: return theano.Apply(self, [A], # return only S [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): super(GpuMagmaSVD, self).prepare_node(node, storage_map, compute_map, impl) # Check node to prevent eventual errors with old pickled nodes. if self.compute_uv: A, B, C = node.outputs # We expect order: S (vector), U (matrix), VT (matrix) assert A.type.ndim == 1 and B.type.ndim == C.type.ndim == 2, \ "Due to implementation constraints, GpuMagmaSVD interface has changed and now returns (S, U, VT) " \ "instead of (U, S, VT). Either update your code, or use gpu_svd() to get the expected (U, S, VT) order." def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def infer_shape(self, node, shapes): x_shape, = shapes M, N = x_shape K = tensor.minimum(M, N) s_shape = (K, ) if self.compute_uv: u_shape = (M, M) if self.full_matrices else (M, K) vt_shape = (N, N) if self.full_matrices else (K, N) return [s_shape, u_shape, vt_shape] else: return [s_shape] def gpu_svd(a, full_matrices=1, compute_uv=1): """ This function performs the SVD on GPU. Parameters ---------- full_matrices : bool, optional If True (default), u and v have the shapes (M, M) and (N, N), respectively. Otherwise, the shapes are (M, K) and (K, N), respectively, where K = min(M, N). compute_uv : bool, optional Whether or not to compute u and v in addition to s. True by default. Returns ------- U, V, D : matrices """ out = GpuMagmaSVD(full_matrices, compute_uv)(a) if compute_uv: S, U, VT = out out = [U, S, VT] return out class GpuMagmaMatrixInverse(GpuMagmaBase): """Computes the inverse of a matrix :math:`A` using magma library. """ __props__ = ('inplace', ) check_input = False params_type = ParamsType(inplace=bool_t, context=gpu_context_type) def __init__(self, inplace=False): COp.__init__(self, ['c_code/magma_inv.c'], 'APPLY_SPECIFIC(magma_inv)') self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} def clone_inplace(self): return self.__class__(inplace=True) def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") return theano.Apply(self, [A], [A.type()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def infer_shape(self, node, shapes): return shapes def gpu_matrix_inverse(a): """ This function performs the matrix inverse on GPU. Returns ------- a_inv: matrix """ return GpuMagmaMatrixInverse()(a) class GpuMagmaCholesky(GpuMagmaBase, CGpuKernelBase): """Computes the cholesky decomposition of a matrix :math:`A` using magma library. """ __props__ = ('lower', 'inplace') check_input = False params_type = ParamsType(lower=bool_t, inplace=bool_t, context=gpu_context_type) def __init__(self, lower=True, inplace=False): self.lower = lower COp.__init__(self, ['c_code/magma_cholesky.c'], 'APPLY_SPECIFIC(magma_cholesky)') self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} def clone_inplace(self): return self.__class__(lower=self.lower, inplace=True) def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") #if A.dtype != 'float32': # PMB removed this check # raise TypeError("only `float32` is supported for now") return theano.Apply(self, [A], [A.type()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def infer_shape(self, node, shapes): return [shapes[0]] # Note to self - L_op is just the linearized and transposed Op. # trivial case when Op is y = A * x, then L_op is x_bar = A.T * y_bar # y_bar = "output_gradients" and is same shape and form as y, # x_bar = "input_gradients" (output of L_op) and is same shape and form as x, def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py # No handling for on_error = 'nan' dz = output_gradients[0] nb,n,l = dz.shape chol_x = outputs[0] # deal with upper triangular by converting to lower triangular if not self.lower: chol_x=chol_x.dimshuffle(0,2,1) dz=dz.dimshuffle(0,2,1) def tril_and_halve_diagonal(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" return tensor.tril(mtx) - tensor.diag(tensor.diagonal(mtx) / 2.) def tril_and_halve_diagonal_batch(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" T=tensor.tril(tensor.ones((n,n))) I=tensor.eye(n) out = mtx*T-I*mtx/2 return out def conjugate_solve_triangular(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" return gpu_solve_upper_triangular( outer.T, gpu_solve_upper_triangular(outer.T, inner.T).T) def conjugate_solve_triangular_batch(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" if True: out,_ = theano.scan(fn=lambda outer,inner : \ solve_upper_triangular( outer.T, solve_upper_triangular(outer.T, inner.T).T) , \ sequences=[outer,inner], non_sequences=[]) return out else: outer_T = outer.dimshuffle(0,2,1) inner_T = inner.dimshuffle(0,2,1) # important note: gpu_solve_upper_triangular_batch works for # matrix RHS, but matrix RHS is transpose RHS! # for this reason, we don't need to transpose 'inner' tmp = gpu_solve_upper_triangular_batch(outer_T, inner_T) tmp_T = tmp.dimshuffle(0,2,1) return gpu_solve_upper_triangular_batch( outer_T, tmp_T) if False: #tmp1 = chol_x.T.dot(dz) tmp1 = tensor.sum( chol_x.dimshuffle(0,2,1,'x') * dz.dimshuffle(0,'x',1,2), axis=2).reshape((nb,n,n)) tmp2 = tril_and_halve_diagonal_batch(tmp1) s = conjugate_solve_triangular_batch( chol_x, tmp2) else: s,_ = theano.scan(fn=lambda chol_x,dz : \ conjugate_solve_triangular(chol_x,tril_and_halve_diagonal(chol_x.T.dot(dz))), \ sequences=[chol_x,dz], non_sequences=[]) if True: if self.lower: grad,_ = theano.scan(fn=lambda s : tensor.tril(s + s.T) - tensor.diag(tensor.diagonal(s)), \ sequences=[s], non_sequences=[]) else: grad,_ = theano.scan(fn=lambda s : tensor.triu(s + s.T) - tensor.diag(tensor.diagonal(s)), \ sequences=[s], non_sequences=[]) else: if self.lower: T=tensor.tril(tensor.ones((n,n))) else: T=tensor.triu(tensor.ones((n,n))) I=tensor.eye(n) grad=(s+s.dimshuffle(0,2,1))*T-s*I.dimshuffle('x',0,1) return [grad] def L_op_(self, inputs, outputs, gradients): # Modified from theano/tensor/slinalg.py # No handling for on_error = 'nan' dz = gradients[0] chol_x = outputs[0] # this is for nan mode # # ok = ~tensor.any(tensor.isnan(chol_x)) # chol_x = tensor.switch(ok, chol_x, 1) # dz = tensor.switch(ok, dz, 1) # deal with upper triangular by converting to lower triangular if not self.lower: chol_x = chol_x.T dz = dz.T def tril_and_halve_diagonal(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" return tensor.tril(mtx) - tensor.diag(tensor.diagonal(mtx) / 2.) def conjugate_solve_triangular(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" return gpu_solve_upper_triangular( outer.T, gpu_solve_upper_triangular(outer.T, inner.T).T) s = conjugate_solve_triangular( chol_x, tril_and_halve_diagonal(chol_x.T.dot(dz))) if self.lower: grad = tensor.tril(s + s.T) - tensor.diag(tensor.diagonal(s)) else: grad = tensor.triu(s + s.T) - tensor.diag(tensor.diagonal(s)) return [grad] class GpuMagmaQR(GpuMagmaBase, CGpuKernelBase): """Computes the qr decomposition of a matrix :math:`A` using magma library. Parameters ---------- complete : If False, returns only ``R``. .. warning:: Because of implementation constraints, this Op returns outputs in order ``R, Q``. Use :func:`theano.gpuarray.linalg.gpu_qr` to get them in expected order ``Q, R``. """ __props__ = ('complete', ) _cop_num_inputs = 1 _cop_num_outputs = 2 check_input = False params_type = ParamsType(complete=bool_t, context=gpu_context_type) def __init__(self, complete=True): self.complete = complete COp.__init__(self, ['c_code/magma_qr.c'], 'APPLY_SPECIFIC(magma_qr)') def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") if self.complete: return theano.Apply(self, [A], # return R, Q [A.type(), A.type()]) else: return theano.Apply(self, [A], # return R [A.type()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) def gpu_qr(a, complete=True): """ This function performs the QR on GPU. Parameters ---------- complete : bool, optional If `False`, returns only r. Returns ------- Q, R : matrices """ out = GpuMagmaQR(complete)(a) if complete: R, Q = out out = [Q, R] return out class GpuMagmaEigh(GpuMagmaBase): """Computes the eigen decomposition of a symmetric matrix :math:`A` using magma library. Parameters ---------- UPLO : Specifies whether the calculation is done with the lower triangular part of matrix (`L`, default) or the upper triangular part (`U`). compute_v : If `True`, computes eigenvalues and eigenvectors (`True`, default). If `False`, computes only eigenvalues of matrix. """ __props__ = ('lower', 'compute_v') _cop_num_inputs = 1 _cop_num_outputs = 2 check_input = False params_type = ParamsType(lower=bool_t, compute_v=bool_t, context=gpu_context_type) def __init__(self, UPLO='L', compute_v=True): assert UPLO in ['L', 'U'] self.lower = UPLO == 'L' self.compute_v = compute_v COp.__init__(self, ['c_code/magma_eigh.c'], 'APPLY_SPECIFIC(magma_eigh)') def make_node(self, A): ctx_name = infer_context_name(A) A = as_gpuarray_variable(A, ctx_name) A = gpu_contiguous(A) if A.ndim != 2: raise LinAlgError("Matrix rank error") if A.dtype != 'float32': raise TypeError("only `float32` is supported for now") if self.compute_v: return theano.Apply(self, [A], # return D, V [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)(), A.type()]) else: return theano.Apply(self, [A], # return D [GpuArrayType(A.dtype, broadcastable=[False], context_name=ctx_name)()]) def get_params(self, node): return self.params_type.get_params(self, context=node.inputs[0].type.context) class GpuCholeskyBatch(Op): """ CUSOLVER GPU Cholesky Op. Given a real positive definite matrix `A` returns either a lower triangular matrix `L` such that `A == dot(L, L.T)` if `lower == True` else returns an upper triangular matrix `U` such that `A == dot(U.T, U)` if `lower == False`. Parameters ---------- lower Whether to return a lower rather than upper triangular decomposition. """ __props__ = ('lower', 'inplace') def __init__(self, lower=True, inplace=False): self.lower = lower self.inplace = inplace if self.inplace: self.destroy_map = {0: [0]} super(GpuCholeskyBatch, self).__init__() def clone_inplace(self): return self.__class__(lower=self.lower, inplace=True) def make_node(self, inp): if not cusolver_available: raise RuntimeError('CUSOLVER is not available and ' 'GpuCholesky Op can not be constructed.') if skcuda.__version__ <= '0.5.1': warnings.warn('The GpuCholesky op requires scikit-cuda > ' '0.5.1 to work with CUDA 8') if not pygpu_available: raise RuntimeError('Missing pygpu or triu/tril functions.' 'Install or update libgpuarray.') context_name = infer_context_name(inp) inp = as_gpuarray_variable(inp, context_name) inp = gpu_contiguous(inp) assert inp.ndim == 3 return theano.Apply(self, [inp], [inp.type()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cusolver_handle_to_context(ctx) def perform(self, node, inputs, outputs): context = inputs[0][0].context # Input matrix. A = inputs[0] nb, l, n = A.shape if l != n: raise ValueError('A must be a square matrix') lda = max(1, n) # cusolver operates on F ordered matrices, but A is expected # to be symmetric so it does not matter. # We copy A if needed if self.inplace: L = A else: L = pygpu.array(A, copy=True) # The output matrix will contain only the upper or lower # triangular factorization of A. If L is C ordered (it # probably is as it is the default in Theano) we just switch # the fill mode parameter of cusolver l_parameter = 0 if self.lower else 1 if L.flags['C_CONTIGUOUS']: l_parameter = 1 - l_parameter L_ptr = L.gpudata if A.dtype == 'float32': potrf_bufferSize = cusolver.cusolverDnSpotrf_bufferSize potrf = cusolver.cusolverDnSpotrf wordlen = 4 elif A.dtype == 'float64': potrf_bufferSize = cusolver.cusolverDnDpotrf_bufferSize potrf = cusolver.cusolverDnDpotrf wordlen = 8 else: raise ValueError("Unsupported dtype") with context: workspace_size = potrf_bufferSize( context.cusolver_handle, l_parameter, n, L_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata dev_info_ptr = dev_info.gpudata for ibatch in range(nb): L_ptr_batch = L.gpudata+ibatch*n*n*wordlen potrf(context.cusolver_handle, l_parameter, n, L_ptr_batch, lda, workspace_ptr, workspace_size, dev_info_ptr) val_dev_info = np.asarray(dev_info)[0] if val_dev_info > 0: raise LinAlgError('Cholesky decomposition failed (is A SPD?)') # cusolver leaves the elements in the matrix outside the considered # upper or lower triangle unchanged, so we need to put zeros outside # the triangle if self.lower: tril(L[ibatch]) else: triu(L[ibatch]) outputs[0][0] = L # Note to self - L_op is just the linearized and transposed Op. # trivial case when Op is y = A * x, then L_op is x_bar = A.T * y_bar # y_bar = "output_gradients" and is same shape and form as y, # x_bar = "input_gradients" (output of L_op) and is same shape and form as x, def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py # No handling for on_error = 'nan' dz = output_gradients[0] nb,n,l = dz.shape chol_x = outputs[0] # deal with upper triangular by converting to lower triangular if not self.lower: chol_x=chol_x.dimshuffle(0,2,1) dz=dz.dimshuffle(0,2,1) def tril_and_halve_diagonal_batch(mtx): """Extracts lower triangle of square matrix and halves diagonal.""" T=tensor.tril(tensor.ones((n,n))) I=tensor.eye(n) out = mtx*T-I*mtx/2 return out def conjugate_solve_triangular_batch(outer, inner): """Computes L^{-T} P L^{-1} for lower-triangular L.""" outer_T = outer.dimshuffle(0,2,1) #inner_T = inner.dimshuffle(0,2,1) # # gpu_solve_upper_triangular_batch works for matrix RHS, # but matrix RHS should be transpose due to using C ordering # for this reason, we don't need to transpose 'tmp' or 'inner', # even though it was done in slinalg.Cholesky.L_op # tmp = gpu_solve_upper_triangular_batch(outer_T, inner) #tmp_T = tmp.dimshuffle(0,2,1) return gpu_solve_upper_triangular_batch( outer_T, tmp) #tmp1 = chol_x.T.dot(dz) tmp1 = tensor.sum( chol_x.dimshuffle(0,2,1,'x') * dz.dimshuffle(0,'x',1,2), axis=2).reshape((nb,n,n)) tmp2 = tril_and_halve_diagonal_batch(tmp1) s = conjugate_solve_triangular_batch( chol_x, tmp2) if self.lower: T=tensor.tril(tensor.ones((n,n))) else: T=tensor.triu(tensor.ones((n,n))) I=tensor.eye(n) grad=(s+s.dimshuffle(0,2,1))*T-s*I.dimshuffle('x',0,1) return [grad] # Batched version of GpuCublasTriangularSolve. Note: this Op works when # RHS is a matrix, bu this is intended for use within an L_op and # works only when b is (batchsize,n,n) and is transposed! class GpuCublasTriangularSolveBatch(Op): """ CUBLAS GPU Triangular Solve Op - Batched. Parameters ---------- lower Whether system is lower-triangular (True) or upper-triangular (False). trans Whether to take the transpose of the input matrix or not. """ __props__ = ('trans', 'lower') def __init__(self, lower=True, trans='N'): self.trans = trans self.lower = lower super(GpuCublasTriangularSolveBatch, self).__init__() def make_node(self, inp1, inp2): if not cublas_available: raise RuntimeError('CUBLAS is not available and ' 'GpuCublasTriangularSolve Op ' 'can not be constructed.') context_name = infer_context_name(inp1, inp2) inp1 = as_gpuarray_variable(inp1, context_name) inp2 = as_gpuarray_variable(inp2, context_name) inp1 = gpu_contiguous(inp1) inp2 = gpu_contiguous(inp2) assert inp1.ndim == 3 assert inp2.ndim in [2,3] assert inp1.dtype == inp2.dtype return theano.Apply(self, [inp1, inp2], [GpuArrayType(inp1.dtype, broadcastable=inp2.broadcastable, context_name=context_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cublas_handle_to_context(ctx) def perform(self, node, inputs, outputs): ctx = node.inputs[0].type.context # Solution set x = outputs[0] # Matrix. A = inputs[0] # right hand side b = inputs[1] assert(len(A.shape) == 3) # nb, n, n assert(len(b.shape) in [2,3]) # (nb, n) or (nb,n,m) # implicitly deal with the difference between C order # and fortran order in A by flipping the trans and lower flags lower = not self.lower trans = self.trans if trans in ['T', 'C']: trans = 'N' nb, l, n = A.shape elif trans == 'N': trans = 'T' nb, n, l = A.shape else: raise ValueError('Invalid value for trans') if len(b.shape) == 3: nb2, k , m = b.shape else: nb2, k = b.shape m=1 if nb != nb2: raise ValueError('A and b must be aligned.') if l != n: raise ValueError('A must be a square matrix') if n != k: raise ValueError('A and b must be aligned.') lda = max(1, n) ldb = max(1, k) # solution overwrites right hand side on exit # PMB: GPU wants F ordering, but input from Python is C ordered # and we need C - ordering to indexing the batch. # this only works id b is a vector (times batch length) # the only time m>1 is when this Op is called from with an L_op # and there, the data must be transposed b = pygpu.array(b, copy=True, order='C') # PMB changed to C ordered A_ptr = A.gpudata b_ptr = b.gpudata # unit scalar used for multiplication alpha = 1.0 # indicates matrix A is on left of B side = 'l' # set whether upper or lower part of matrix A stored uplo = 'l' if lower else 'u' # indicates elements on diagonal of matrix A may not be unity diag = 'n' if A.dtype == 'float32': trsv = cublas.cublasStrsv trsm = cublas.cublasStrsm wordlen=4 elif A.dtype == 'float64': trsv = cublas.cublasDtrsv trsm = cublas.cublasDtrsm wordlen=8 else: raise ValueError("Unsupported dtype") with ctx: for ib in range(nb): # matrix vector solve trsv(ctx.cublas_handle, uplo, trans, diag, n, A_ptr+ib*n*n*wordlen, lda, b_ptr+ib*n*m*wordlen, 1) if m>1: x[0] = b.reshape((nb,n,m)) else: x[0] = b.reshape((nb,n,)) # Note to self - L_op is just the linearized and transposed Op. # trivial case when Op is y = A * x, then L_op is x_bar = A.T * y_bar # y_bar = "output_gradients" and is same shape and form as y, # x_bar = "input_gradients" (outpit of L_op) and is same shape and form as x, # # Note: L_op will only work for m=1 (b is a batched vector) def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py A, b = inputs nb, n, l = A.shape nb2, k = b.shape # reshape variables with batch as first dimension c = outputs[0] # c is of shape (nb,n) # get the dimension of the gradients c_bar = output_gradients[0] # c_bar is of shape (nb,n) trans_solve_op = GpuCublasTriangularSolveBatch(not self.lower) # Operation for non-batched version: b_bar = trans_solve_op(A.T, c_bar) A_T = A.dimshuffle(0,2,1) b_bar = trans_solve_op(A_T, c_bar) # b_bar is shape (nb,n) # Operation for non-batched version: A_bar = -tensor.outer(b_bar, c) A_bar = - b_bar.dimshuffle(0,1,'x') * c.dimshuffle(0,'x',1) A_bar=A_bar.reshape((nb,n,n)) # non-batched operation: #if self.lower: # A_bar=: tensor.tril(A_bar) #else: # A_bar= tensor.triu(A_bar) # batched operation: if self.lower: I=tensor.tril(tensor.ones((n,n))) else: I=tensor.triu(tensor.ones((n,n))) A_bar = A_bar * I.dimshuffle('x',0,1) return [A_bar, b_bar] def gpu_solve_upper_triangular_batch(A, b, trans='N'): return GpuCublasTriangularSolveBatch(False, trans)(A, b) class GpuCusolverSolveBatch(Op): """ CUSOLVER GPU solver OP. Parameters ---------- trans Whether to take the transpose of the input matrix or not. """ __props__ = ('A_structure', 'trans', 'inplace') def __init__(self, A_structure='general', trans='N', inplace=False): self.trans = trans self.inplace = inplace self.A_structure = A_structure if self.inplace: self.destroy_map = {0: [0]} assert A_structure in MATRIX_STRUCTURES_SOLVE super(GpuCusolverSolveBatch, self).__init__() def make_node(self, inp1, inp2): if not cusolver_available: raise RuntimeError('CUSOLVER is not available and ' 'GpuCusolverSolve Op can not be constructed.') if skcuda.__version__ <= '0.5.1': warnings.warn('The GpuSolve op requires scikit-cuda > 0.5.1 to work with CUDA 8') context_name = infer_context_name(inp1, inp2) inp1 = as_gpuarray_variable(inp1, context_name) inp2 = as_gpuarray_variable(inp2, context_name) inp1 = gpu_contiguous(inp1) inp2 = gpu_contiguous(inp2) assert inp1.ndim == 3 assert inp2.ndim == 3 assert inp1.dtype == inp2.dtype return theano.Apply( self, [inp1, inp2], [GpuArrayType(inp1.dtype, broadcastable=inp1.broadcastable, context_name=context_name)()]) def prepare_node(self, node, storage_map, compute_map, impl): ctx = node.inputs[0].type.context attach_cusolver_handle_to_context(ctx) def check_dev_info(self, dev_info): val = np.asarray(dev_info)[0] if val > 0: raise LinAlgError('A is singular') def perform(self, node, inputs, outputs): context = inputs[0][0].context # Size of the matrices to invert. z = outputs[0] # Matrix. A = inputs[0] # Solution vectors. b = inputs[1] assert(len(A.shape) == 3) assert(len(b.shape) == 3) if self.trans in ['T', 'C']: trans = 1 nb,l, n = A.shape nb2, k, m = b.shape elif self.trans == 'N': trans = 0 nb, n, l = A.shape nb2, k, m = b.shape else: raise ValueError('Invalid value for trans') if nb != nb2: raise ValueError('A and b must have same batchlen') if l != n: raise ValueError('A must be a square matrix') if n != k: raise ValueError('A and b must be aligned.') lda = max(1, n) ldb = max(1, k) # We copy A and b as cusolver operates inplace b = pygpu.array(b, copy=True, order='C') # PMB, changed to need C ordring to index by batch print('PMB: inplace=', self.inplace) if not self.inplace: A = pygpu.array(A, copy=True) trans = 1 - trans A_ptr = A.gpudata b_ptr = b.gpudata # cusolver expects a F ordered matrix, but A is not explicitly # converted between C and F order, instead we switch the # "transpose" flag. print('PMB: trans=', trans) if A.flags['C_CONTIGUOUS']: trans = 1 - trans if A.dtype == 'float32': potrf_bufferSize = cusolver.cusolverDnSpotrf_bufferSize potrf = cusolver.cusolverDnSpotrf potrs = cusolverDnSpotrs getrf_bufferSize = cusolver.cusolverDnSgetrf_bufferSize getrf = cusolver.cusolverDnSgetrf getrs = cusolver.cusolverDnSgetrs wordlen=4 elif A.dtype == 'float64': potrf_bufferSize = cusolver.cusolverDnDpotrf_bufferSize potrf = cusolver.cusolverDnDpotrf potrs = cusolverDnDpotrs getrf_bufferSize = cusolver.cusolverDnDgetrf_bufferSize getrf = cusolver.cusolverDnDgetrf getrs = cusolver.cusolverDnDgetrs wordlen=8 else: raise ValueError("Unsupported dtype") print('PMB: struct=', self.A_structure) if self.A_structure == 'symmetric': with context: workspace_size = potrf_bufferSize( context.cusolver_handle, 0, n, A_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata dev_info_ptr = dev_info.gpudata with context: potrf( context.cusolver_handle, 0, n, A_ptr, lda, workspace_ptr, workspace_size, dev_info_ptr) self.check_dev_info(dev_info) potrs( context.cusolver_handle, 0, n, m, A_ptr, lda, b_ptr, ldb, dev_info_ptr) else: print('PMB: general case') # general case for A with context: workspace_size = getrf_bufferSize( context.cusolver_handle, n, n, A_ptr, lda) workspace = pygpu.zeros(workspace_size, dtype=A.dtype, context=context) pivots = pygpu.zeros(n, dtype='int32', context=context) dev_info = pygpu.zeros((1,), dtype='int32', context=context) workspace_ptr = workspace.gpudata pivots_ptr = pivots.gpudata dev_info_ptr = dev_info.gpudata with context: for ib in range(nb): #for ib in range(1): getrf( context.cusolver_handle, n, n, A_ptr+ib*n*n, lda, workspace_ptr, pivots_ptr, dev_info_ptr) self.check_dev_info(dev_info) getrs( context.cusolver_handle, trans, n, m, A_ptr+ib*n*n, lda, pivots_ptr, b_ptr+ib*n*m, ldb, dev_info_ptr) #z[0] = b.reshape((nb*n,m)) z[0] = b def L_op(self, inputs, outputs, output_gradients): # Modified from theano/tensor/slinalg.py A, b = inputs nb,n,l =A.shape nb2,n,m =b.shape c = outputs[0] c=c.reshape((nb,n)) # c is (nb*n,m), but grad should only be used when m=1 c_bar = output_gradients[0] cdim=theano.tensor.prod(c_bar.shape) ndrv = cdim//nb//n c_bar=c_bar.reshape((nb,n,ndrv)) # FIXME: triangular structure would use GpuCublasTriangularsolve? # no need to handle A_structure like slinalg.py? trans_solve_op = GpuCusolverSolveBatch('general') A_T=A.dimshuffle(0,2,1) b_bar = trans_solve_op(A_T, c_bar) b_bar=b_bar.reshape((nb,n,ndrv)) #A_bar = -tensor.outer(b_bar, c) A_bar = - b_bar.dimshuffle(0,1,2,'x') * c.dimshuffle(0,'x','x',1) A_bar=A_bar.reshape((nb*n,n,ndrv)) b_bar=b_bar.reshape(c_bar.shape) return [A_bar, b_bar]