On Wed, 3 Feb 2016, Nathan Sidwell wrote: > You can only override at runtime those dimensions that you said you'd override > at runtime when you compiled your program.
Ah, I see. That's not obvious to me, so perhaps added documentation can be expanded to explain that? (I now see that the plugin silently drops user-provided dimensions where a value recorded at compile time is present; not sure if that'd be worth a runtime diagnostic, could be very noisy) > > I don't see why you say that because cuDeviceGetAttribute provides > > CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, > > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) > > and cuFuncGetAttribute that allows to get a per-function thread limit. > > There's a patch on gomp-nvptx branch that adds querying some of those to > > the plugin. > > thanks. There doesn't appear to be one for number of physical CTAs though, > right? Sorry, I don't understand the question: CTA is a logical entity. One could derive limit of possible concurrent CTAs from number of SMs (CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many CTAs fit on one multiprocessor. The latter figure can be taken as a rough worst-case value, or semi-intelligent per-kernel estimate based on register limits (there's code on gomp-nvptx branch that does this), or one can use the cuOcc* API to ask the driver for a precise per-kernel figure. Alexander