In case anyone needs this in the future, here is what I managed to put together, and please let me know if I am doing something reckless or wrong. It is slightly faster than numpy.asfortranarray and it doesn't show any cache miss symptoms but can't say I did a thorough bencmark testing. I chose the blocksize 16 completely based on the current locations of the planets. 32 or 64 can also work but NumPy/SciPy is used on all kinds of esoteric places so went for a small number.
@cython.cdivision(True) @cython.wraparound(False) @cython.boundscheck(False) @cython.initializedcheck(False) cdef void swap_c_and_f_layout(double *a, double *b, int r, int c, int n) nogil: """Recursive matrix transposition from a to b, both n**2-long flat arrays""" cdef int i, j, ith_row, r2, c2 cdef double *bb=b cdef double *aa=a if c < 16: for j in range(c): ith_row = 0 for i in range(r): bb[ith_row] = aa[i] ith_row += n aa += n bb += 1 else: # If tall if (r > c): r2 = r//2 swap_c_and_f_layout(a, b, r2, c, n) swap_c_and_f_layout(a + r2, b+(r2)*n, r-r2, c, n) else: # Nope c2 = c//2 swap_c_and_f_layout(a, b, r, c2, n); swap_c_and_f_layout(a+(c2)*n, b+c2, r, c-c2, n) For the desperate souls reading this in the future; I feel your pain :) On Thu, Nov 11, 2021 at 3:36 AM Ilhan Polat <ilhanpo...@gmail.com> wrote: > Ah I see. Thank you Sebastian, I was hoping to avoid all that blocking > (since HW dependency leaves some performance at many tables) or recursive > zooming stuff with some off-the-shelf tool but apparently I'm walking in > the dusty corners again collecting spider webs :) As you said, there are > quite a lot of low hanging fruits we might collect regarding such data > manipulations which will boost basically everything since these ops are > ubiquitous. > > In case any one is wondering the context; this is for the > scipy.linalg.expm overhaul mainly kept updated at > https://github.com/scipy/scipy/issues/12838 > > > > On Thu, Nov 11, 2021 at 2:40 AM Sebastian Berg <sebast...@sipsolutions.net> > wrote: > >> On Thu, 2021-11-11 at 01:04 +0100, Ilhan Polat wrote: >> > Hmm not sure I understand the question but this is what I mean by naive >> > looping, suppose I allocate a scratch register work3, then >> > >> > for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j] >> > >> >> NumPy does not end up doing anything special. Special would be to use >> a blocked iteration and NumPy doesn't have it unfortunately. >> The only thing it does is use pointers to cut some overheads, something >> (very rough) like: >> >> ptr1 = arr1.data >> ptr2_col = arr2.data >> >> strides2_col = arr.strides[0] >> strides2_row = arr2.strides[1] >> >> for i in range(n): >> ptr2 = ptr2_col >> for j in range(n): >> *ptr2 = *ptr1 >> ptr1++ >> ptr2 += strides2_row >> >> ptr2_col += strides2_col >> >> And if you write that in cython, you are likely faster since you can >> cut quite a few corners (all is aligned, contiguous, etc.). >> (with potentially, loop unrolling/compiler optimization fluctuations, >> numpy probably tells GCC to unroll and optimize the innermost loop >> there) >> >> I would not be surprised if you can find a lightweight fast copy- >> transpose out there, or if some tools like MKL/Cuda just include it. It >> is too bad NumPy is missing it. >> >> Cheers, >> >> Sebastian >> >> >> > >> > >> > This basically doing the row to column based indexing and obviously we >> > create a lot of cache misses since work3 entries are accessed in the >> > shuffled fashion. The idea of all this Cython attempt is to avoid such >> > access hence if the original some_C_layout_func takes 10 units of time, >> > 6 >> > of it is spent on this loop when the data doesn't fit the cache. When I >> > discard the correctness of the function and comment out this loop and >> > then >> > remeasure the original func spends roughly 3 units of time. However >> > take >> > any random array in C order in NumPy using regular Python and use >> > np.asfortranarray() it spends roughly about 0.1 units of time. So >> > apparently it is possible to do this somehow at the low level in a >> > performant way. That's what I would like to understand or clear out my >> > misunderstanding. >> > >> > >> > >> > >> > >> > On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.and...@gmail.com> >> > wrote: >> > >> > > On Thursday, November 11, 2021, Ilhan Polat <ilhanpo...@gmail.com> >> > > wrote: >> > > >> > > > I've asked this in Cython mailing list but probably I should also >> > > > get >> > > > some feedback here too. >> > > > >> > > > I have the following function defined in Cython and using flat >> > > > memory >> > > > pointers to hold n by n array data. >> > > > >> > > > >> > > > cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef >> > > > double * >> > > > work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 = >> > > > <double >> > > > *>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations >> > > > here # >> > > > ... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], >> > > > &n, & >> > > > info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...) >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > Here, I have done everything in C layout with work1 and work2 but I >> > > > have >> > > > to convert work2 into Fortran layout to be able to solve AX = B. A >> > > > can be >> > > > transposed in Lapack internally via the flag 'T' so the only >> > > > obstacle I >> > > > have now is to shuffle work2 which holds B transpose in the eyes of >> > > > Fortran >> > > > since it is still in C layout. >> > > > >> > > > If I go naively and make loops to get one layout to the other that >> > > > actually spoils all the speed benefits from this Cythonization due >> > > > to cache >> > > > misses. In fact 60% of the time is spent in that naive loop across >> > > > the >> > > > whole function. >> > > > >> > > > >> > > Sorry if this is a dumb question, but is this true whether or not you >> > > loop >> > > over contiguous blocks of the input vs the output array? Or is the >> > > faster >> > > of the two options still slower than the linsolve? >> > > >> > > AndrĂ¡s >> > > >> > > >> > > > >> > > > Same goes for the copy_fortran() of memoryviews. >> > > > >> > > > I have measured the regular NumPy np.asfortranarray() and the >> > > > performance is quite good enough compared to the actual linear >> > > > solve. Hence >> > > > whatever it is doing underneath I would like to reach out and do >> > > > the same >> > > > possibly via the C-API. But my C knowledge basically failed me >> > > > around this >> > > > line >> > > > >> https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817 >> > > > >> > > > I have found the SO post from >> > > > >> https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous >> > > > but I am not sure if that is the canonical way to do it in newer >> > > > Python >> > > > versions. >> > > > >> > > > Can anyone show me how to go about it without interacting with >> > > > Python >> > > > objects? >> > > > >> > > > Best, >> > > > ilhan >> > > > >> > > _______________________________________________ >> > > NumPy-Discussion mailing list -- numpy-discussion@python.org >> > > To unsubscribe send an email to numpy-discussion-le...@python.org >> > > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> > > Member address: ilhanpo...@gmail.com >> > > >> > _______________________________________________ >> > NumPy-Discussion mailing list -- numpy-discussion@python.org >> > To unsubscribe send an email to numpy-discussion-le...@python.org >> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> > Member address: sebast...@sipsolutions.net >> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: ilhanpo...@gmail.com >> >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com