Re: [PyOpenCL] invalid work group size, on Mac OSX Lion

Vincent Favre-Nicolin Sat, 29 Oct 2011 05:29:41 -0700

    Hi,

  I tried this (current git) on my new Mac mini (running OSX Lion and XCode 
4.2), and now test_array has 2 failures (one on the i5, one on the ATI 6630M):
 
test_array.py ........................F............................F......


cl_math and wrapper passed (one skipped for cl_math).

  Below: dump_properties, and result of test_array.

  I'm trying my own code and getting issues with "invalid workgroup size" on 
the i5 (no problem on the ATI card), but it may be my mistake - I still haven't 
understood everything about work sizes when moving from cuda to opencl...

  In the examples, transpose.py fails (in the i5) when starting the "naive" 
test with an invalid work group size, maybe because of the fixed block size=16 ?
 

    Vincent


####################################################################

Mac-mini-de-Vincent:test vincent$ python ../examples/dump-properties.py 
===========================================================================
<pyopencl.Platform 'Apple' at 0x7fff0000>
===========================================================================
EXTENSIONS: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions 
cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event
NAME: Apple
PROFILE: FULL_PROFILE
VENDOR: Apple
VERSION: OpenCL 1.1 (Jul 25 2011 15:56:07)
Oct 29 13:46:55 Mac-mini-de-Vincent.local Python[144] <Error>: kCGErrorFailure: 
Set a breakpoint @ CGErrorBreakpoint() to catch errors as they are logged.
---------------------------------------------------------------------------
<pyopencl.Device 'Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz' on 'Apple' at 
0xffffffff>
---------------------------------------------------------------------------
ADDRESS_BITS: 64
AVAILABLE: 1
COMPILER_AVAILABLE: 1
DOUBLE_FP_CONFIG: 63
DRIVER_VERSION: 1.1
ENDIAN_LITTLE: 1
ERROR_CORRECTION_SUPPORT: 0
EXECUTION_CAPABILITIES: 3
EXTENSIONS: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions 
cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event 
cl_khr_fp64 cl_khr_global_int32_base_atomics 
cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics 
cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store 
cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes 
cl_APPLE_fp64_basic_ops cl_APPLE_fixed_alpha_channel_orders 
cl_APPLE_biased_fixed_point_image_formats
GLOBAL_MEM_CACHELINE_SIZE: 3145728
GLOBAL_MEM_CACHE_SIZE: 64
GLOBAL_MEM_CACHE_TYPE: 2
GLOBAL_MEM_SIZE: 4294967296
IMAGE2D_MAX_HEIGHT: 8192
IMAGE2D_MAX_WIDTH: 8192
IMAGE3D_MAX_DEPTH: 2048
IMAGE3D_MAX_HEIGHT: 2048
IMAGE3D_MAX_WIDTH: 2048
IMAGE_SUPPORT: 1
LOCAL_MEM_SIZE: 32768
LOCAL_MEM_TYPE: 2
MAX_CLOCK_FREQUENCY: 2500
MAX_COMPUTE_UNITS: 4
MAX_CONSTANT_ARGS: 8
MAX_CONSTANT_BUFFER_SIZE: 65536
MAX_MEM_ALLOC_SIZE: 1073741824
MAX_PARAMETER_SIZE: 4096
MAX_READ_IMAGE_ARGS: 128
MAX_SAMPLERS: 16
MAX_WORK_GROUP_SIZE: 1024
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_ITEM_SIZES: [1024, 1, 1]
MAX_WRITE_IMAGE_ARGS: 8
MEM_BASE_ADDR_ALIGN: 1024
MIN_DATA_TYPE_ALIGN_SIZE: 128
NAME: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
PLATFORM: <pyopencl.Platform 'Apple' at 0x7fff0000>
PREFERRED_VECTOR_WIDTH_CHAR: 16
PREFERRED_VECTOR_WIDTH_DOUBLE: 2
PREFERRED_VECTOR_WIDTH_FLOAT: 4
PREFERRED_VECTOR_WIDTH_INT: 4
PREFERRED_VECTOR_WIDTH_LONG: 2
PREFERRED_VECTOR_WIDTH_SHORT: 8
PROFILE: FULL_PROFILE
PROFILING_TIMER_RESOLUTION: 1
QUEUE_PROPERTIES: 2
SINGLE_FP_CONFIG: 63
TYPE: 2
VENDOR: Intel
VENDOR_ID: 4294967295
VERSION: OpenCL 1.1 
IMAGE2D READ_ONLY FORMATS: RGBA-F, <unknown channel order 0x10bc>-F, <unknown 
channel order 0x10bb>-F, <unknown channel order 0x10ba>-F, INTENSITY-HALF_F, 
RGBA-HALF_F, <unknown channel order 0x10bc>-HALF_F, RGBA-<unknown channel data 
type 268435464>, RGBA-S16, RGBA-S32, RGBA-S8, RGBA-SN16, RGBA-SN8, RGBA-U16, 
RGBA-U32, RGBA-U8, RGBA-UN16, <unknown channel order 0x10bc>-UN16, <unknown 
channel order 0x10000006>-UN8, A-UN8, <unknown channel order 0x10b7>-UN8, 
<unknown channel order 0x10000007>-UN8, BGRA-UN8, INTENSITY-UN8, RGBA-UN8, 
<unknown channel order 0x10bc>-UN8, <unknown channel order 0x10bb>-UN8

IMAGE3D READ_ONLY FORMATS: RGBA-F, <unknown channel order 0x10bc>-F, <unknown 
channel order 0x10bb>-F, <unknown channel order 0x10ba>-F, INTENSITY-HALF_F, 
RGBA-HALF_F, <unknown channel order 0x10bc>-HALF_F, RGBA-<unknown channel data 
type 268435464>, RGBA-S16, RGBA-S32, RGBA-S8, RGBA-SN16, RGBA-SN8, RGBA-U16, 
RGBA-U32, RGBA-U8, RGBA-UN16, <unknown channel order 0x10bc>-UN16, <unknown 
channel order 0x10000006>-UN8, A-UN8, <unknown channel order 0x10b7>-UN8, 
<unknown channel order 0x10000007>-UN8, BGRA-UN8, INTENSITY-UN8, RGBA-UN8, 
<unknown channel order 0x10bc>-UN8, <unknown channel order 0x10bb>-UN8

---------------------------------------------------------------------------
<pyopencl.Device 'ATI Radeon HD 6630M' on 'Apple' at 0x1021b00>
---------------------------------------------------------------------------
ADDRESS_BITS: 32
AVAILABLE: 1
COMPILER_AVAILABLE: 1
DOUBLE_FP_CONFIG: 0
DRIVER_VERSION: 1.0
ENDIAN_LITTLE: 1
ERROR_CORRECTION_SUPPORT: 0
EXECUTION_CAPABILITIES: 1
EXTENSIONS: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions 
cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event 
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics 
cl_khr_byte_addressable_store cl_khr_3d_image_writes
GLOBAL_MEM_CACHELINE_SIZE: 0
GLOBAL_MEM_CACHE_SIZE: 0
GLOBAL_MEM_CACHE_TYPE: 0
GLOBAL_MEM_SIZE: 134217728
IMAGE2D_MAX_HEIGHT: 8192
IMAGE2D_MAX_WIDTH: 8192
IMAGE3D_MAX_DEPTH: 2048
IMAGE3D_MAX_HEIGHT: 2048
IMAGE3D_MAX_WIDTH: 2048
IMAGE_SUPPORT: 1
LOCAL_MEM_SIZE: 32768
LOCAL_MEM_TYPE: 1
MAX_CLOCK_FREQUENCY: 485
MAX_COMPUTE_UNITS: 6
MAX_CONSTANT_ARGS: 8
MAX_CONSTANT_BUFFER_SIZE: 65536
MAX_MEM_ALLOC_SIZE: 134217728
MAX_PARAMETER_SIZE: 1024
MAX_READ_IMAGE_ARGS: 128
MAX_SAMPLERS: 16
MAX_WORK_GROUP_SIZE: 1024
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_ITEM_SIZES: [1024, 1024, 1024]
MAX_WRITE_IMAGE_ARGS: 8
MEM_BASE_ADDR_ALIGN: 32768
MIN_DATA_TYPE_ALIGN_SIZE: 128
NAME: ATI Radeon HD 6630M
PLATFORM: <pyopencl.Platform 'Apple' at 0x7fff0000>
PREFERRED_VECTOR_WIDTH_CHAR: 16
PREFERRED_VECTOR_WIDTH_DOUBLE: 0
PREFERRED_VECTOR_WIDTH_FLOAT: 4
PREFERRED_VECTOR_WIDTH_INT: 4
PREFERRED_VECTOR_WIDTH_LONG: 2
PREFERRED_VECTOR_WIDTH_SHORT: 8
PROFILE: FULL_PROFILE
PROFILING_TIMER_RESOLUTION: 37
QUEUE_PROPERTIES: 2
SINGLE_FP_CONFIG: 30
TYPE: 4
VENDOR: AMD
VENDOR_ID: 16915200
VERSION: OpenCL 1.1 
IMAGE2D READ_ONLY FORMATS: RGBA-SN8, RGBA-SN16, RGBA-UN8, RGBA-UN16, RGBA-S8, 
RGBA-S16, RGBA-S32, RGBA-U8, RGBA-U16, RGBA-U32, RGBA-HALF_F, RGBA-F, BGRA-UN8, 
<unknown channel order 0x10b7>-UN8, R-SN8, R-UN8, R-UN16, R-S8, R-S16, R-S32, 
R-U8, R-U16, R-U32, R-HALF_F, R-F, A-SN8, A-UN8, A-UN16, A-S8, A-S16, A-S32, 
A-U8, A-U16, A-U32, A-HALF_F, A-F, RG-UN8, RG-UN16, RG-S8, RG-S16, RG-S32, 
RG-U8, RG-U16, RG-U32, RG-HALF_F, RG-F, RA-UN8, RA-UN16, RA-S8, RA-S16, RA-S32, 
RA-U8, RA-U16, RA-U32, RA-HALF_F, RA-F, INTENSITY-UN8, INTENSITY-UN16, 
INTENSITY-HALF_F, INTENSITY-F

IMAGE3D READ_ONLY FORMATS: RGBA-SN8, RGBA-SN16, RGBA-UN8, RGBA-UN16, RGBA-S8, 
RGBA-S16, RGBA-S32, RGBA-U8, RGBA-U16, RGBA-U32, RGBA-HALF_F, RGBA-F, BGRA-UN8, 
<unknown channel order 0x10b7>-UN8, R-SN8, R-UN8, R-UN16, R-S8, R-S16, R-S32, 
R-U8, R-U16, R-U32, R-HALF_F, R-F, A-SN8, A-UN8, A-UN16, A-S8, A-S16, A-S32, 
A-U8, A-U16, A-U32, A-HALF_F, A-F, RG-UN8, RG-UN16, RG-S8, RG-S16, RG-S32, 
RG-U8, RG-U16, RG-U32, RG-HALF_F, RG-F, RA-UN8, RA-UN16, RA-S8, RA-S16, RA-S32, 
RA-U8, RA-U16, RA-U32, RA-HALF_F, RA-F, INTENSITY-UN8, INTENSITY-UN16, 
INTENSITY-HALF_F, INTENSITY-F




####################################################################

Mac-mini-de-Vincent:test vincent$ python test_array.py 
================================================================================
 test session starts 
================================================================================
platform darwin -- Python 2.7.2 -- pytest-2.1.3
collected 60 items 

test_array.py ........................F............................F......

=====================================================================================
 FAILURES 
======================================================================================
_______________________ test_random[ctx_factory=<context factory for 
<pyopencl.Device 'Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz' on 'Apple' at 
0xffffffff>>] ________________________

ctx_factory = <pyopencl.tools.ContextFactory instance at 0x1059a4680>

    @pytools.test.mark_test.opencl
    def test_random(ctx_factory):
        context = ctx_factory()
        queue = cl.CommandQueue(context)
    
        from pyopencl.clrandom import RanluxGenerator
    
        if has_double_support(context.devices[0]):
            dtypes = [np.float32, np.float64]
        else:
            dtypes = [np.float32]
    
        gen = RanluxGenerator(queue, 5120)
    
        for ary_size in [300, 301, 302, 303, 10007]:
            for dtype in dtypes:
                ran = cl_array.zeros(queue, ary_size, dtype)
                gen.fill_uniform(ran)
                assert (0 < ran.get()).all()
                assert (ran.get() < 1).all()
    
                gen.synchronize(queue)
    
                ran = cl_array.zeros(queue, ary_size, dtype)
                gen.fill_uniform(ran, a=4, b=7)
>               assert (4 < ran.get()).all()
E               assert <built-in method all of numpy.ndarray object at 
0x7f82e4d006d0>()
E                +  where <built-in method all of numpy.ndarray object at 
0x7f82e4d006d0> = 4 < array([ 6.19607544,  4.17939472,  4.86412287,  
6.28594398,  4.        ,\n      ...4.        ,  4.        ,  4.        ,  4.    
    ,  4.        ], dtype=float32).all                                          
                                                           
E                +    where array([ 6.19607544,  4.17939472,  4.86412287,  
6.28594398,  4.        ,\n      ...4.        ,  4.        ,  4.        ,  4.    
    ,  4.        ], dtype=float32) = <bound method Array.get of array([ 
6.19607544,  4.17939472,  4.86412287,  6.28....        ,  4.        ,  4.       
 ,  4.        ,  4.        ], dtype=float32)>()        
E                +      where <bound method Array.get of array([ 6.19607544,  
4.17939472,  4.86412287,  6.28....        ,  4.        ,  4.        ,  4.       
 ,  4.        ], dtype=float32)> = array([ 6.19607544,  4.17939472,  
4.86412287,  6.28594398,  4.        ,\n      ...4.        ,  4.        ,  4.    
    ,  4.        ,  4.        ], dtype=float32).get    

test_array.py:249: AssertionError
----------------------------------------------------------------------------------
 Captured stderr 
----------------------------------------------------------------------------------
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyopencl-2011.1.2-py2.7-macosx-10.7-x86_64.egg/pyopencl/__init__.py:36:
 CompilerWarning: Non-empty compiler output encountered. Set the environment 
variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  "to see more.", CompilerWarning)
___________________________________ test_scan[ctx_factory=<context factory for 
<pyopencl.Device 'ATI Radeon HD 6630M' on 'Apple' at 0x1021b00>>] 
____________________________________

ctx_factory = <pyopencl.tools.ContextFactory instance at 0x1059ab248>

    @pytools.test.mark_test.opencl
    def test_scan(ctx_factory):
        context = ctx_factory()
        queue = cl.CommandQueue(context)
    
        from pyopencl.scan import InclusiveScanKernel, ExclusiveScanKernel
    
        dtype = np.int32
        for cls in [InclusiveScanKernel, ExclusiveScanKernel]:
            knl = cls(context, dtype, "a+b", "0")
    
            for n in [
                10, 2 ** 10 - 5, 2 ** 10,
                2 ** 20 - 2 ** 18,
                2 ** 20 - 2 ** 18 + 5,
                2 ** 10 + 5,
                2 ** 20 + 5,
                2 ** 20, 2 ** 24]:
                host_data = np.random.randint(0, 10, n).astype(dtype)
                dev_data = cl_array.to_device(queue, host_data)
    
>               knl(dev_data)

test_array.py:592: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _

self = <pyopencl.compyte.scan.InclusiveScanKernel object at 0x1062e4650>, 
input_ary = array([9, 2, 8, ..., 9, 9, 0], dtype=int32)
output_ary = array([9, 2, 8, ..., 9, 9, 0], dtype=int32), allocator = 
<pyopencl._cl.CLAllocator object at 0x106362788>, queue = 
<pyopencl._cl.CommandQueue object at 0x106362680>

    def __call__(self, input_ary, output_ary=None, allocator=None,
            queue=None):
        allocator = allocator or input_ary.allocator
        queue = queue or input_ary.queue or output_ary.queue
    
        if output_ary is None:
            output_ary = input_ary
    
        if isinstance(output_ary, (str, unicode)) and output_ary == "new":
            output_ary = cl_array.empty_like(input_ary, allocator=allocator)
    
        if input_ary.shape != output_ary.shape:
            raise ValueError("input and output must have the same shape")
    
        if not input_ary.flags.forc:
            raise RuntimeError("ScanKernel cannot "
                    "deal with non-contiguous arrays")
    
        n, = input_ary.shape
    
        if not n:
            return output_ary
    
        unit_size  = self.scan_wg_size * self.scan_wg_seq_batches
        max_groups = 3*max(dev.max_compute_units for dev in self.devices)
    
        from pytools import uniform_interval_splitting
        interval_size, num_groups = uniform_interval_splitting(
                n, unit_size, max_groups);
    
        block_results = allocator(self.dtype.itemsize*num_groups)
        dummy_results = allocator(self.dtype.itemsize)
    
        # first level scan of interval (one interval per block)
        self.scan_intervals_knl(
                queue, (num_groups*self.scan_wg_size,), (self.scan_wg_size,),
                input_ary.data,
                n, interval_size,
                output_ary.data,
>               block_results)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyopencl-2011.1.2-py2.7-macosx-10.7-x86_64.egg/pyopencl/compyte/scan.py:481:
 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _

self = <pyopencl._cl.Kernel object at 0x105974d08>, queue = 
<pyopencl._cl.CommandQueue object at 0x106362680>, global_size = (2304,)

    def kernel_call(self, queue, global_size, *args, **kwargs):
        global_offset = kwargs.pop("global_offset", None)
        had_local_size = "local_size" in kwargs
        local_size = kwargs.pop("local_size", None)
        g_times_l = kwargs.pop("g_times_l", False)
        wait_for = kwargs.pop("wait_for", None)
    
        if kwargs:
            raise TypeError(
                    "Kernel.__call__ recived unexpected keyword arguments: %s"
                    % ", ".join(kwargs.keys()))
    
        if had_local_size:
            from warnings import warn
            warn("The local_size keyword argument is deprecated and will be "
                    "removed in pyopencl 2012.x. Pass the local "
                    "size as the third positional argument instead.",
                    DeprecationWarning, stacklevel=2)
    
        if isinstance(args[0], (type(None), tuple)) and not had_local_size:
            local_size = args[0]
            args = args[1:]
        elif not had_local_size:
            from warnings import warn
            warn("PyOpenCL Warning: There was an API change "
                    "in Kernel.__call__() in pyopencl 0.92. "
                    "local_size was moved from keyword argument to third "
                    "positional argument in pyopencl 0.92. "
                    "You didn't pass local_size, but you still need to insert "
                    "'None' as a third argument. "
                    "Your present usage is deprecated and will stop "
                    "working in pyopencl 2012.x.",
                    DeprecationWarning, stacklevel=2)
    
        self.set_args(*args)
    
        return enqueue_nd_range_kernel(queue, self, global_size, local_size,
>               global_offset, wait_for, g_times_l=g_times_l)
E   MemoryError: clEnqueueNDRangeKernel failed: mem object allocation failure

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyopencl-2011.1.2-py2.7-macosx-10.7-x86_64.egg/pyopencl/__init__.py:247:
 MemoryError
======================================================================= 2 
failed, 58 passed in 15.14 seconds 
========================================================================

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Re: [PyOpenCL] invalid work group size, on Mac OSX Lion

Reply via email to