Current kernel framework divides inputs (e.g. arrays, chunked arrays) into
batches and feeds to kernel code.
Does it make sense to pass input args directly to kernel?
I'm writing quantile kernel, need to allocate buffer to record all inputs and
find nth at last. For chunked array, input is received chunk by chunk, kernel
don't know the total buffer size to be allocated all at once. It will be
convenient if the raw chunked array input is seen by the kernel.
Or there are better ways to achieve this? Thanks.