Hello everyone. Andreas - do not worry, I know how academic life look like ;-) I just wanted to be sure that my patch did not disappear "under the radar".
Dnia 2010-10-22, pią o godzinie 09:27 -0400, Frédéric Bastien pisze: > I hit a similar problem when porting Theano to a newer version of cuda (the > one that support Fermi card). The problem is not the card, but the > compiler/card combination. What happen is that newer nvcc version make > more code optimization and it break some old assumption that many > people did. The new optimization is based on the new assumption that > they can move the order of operation between wrap if their is no > synchronization barrier. I think this was not a problem on pre Fermi > card as the wrap was executed sequentially, but that is not always > true on the Fermi card(I'm not certain for the pre Fermi, but I'm > sur for the Fermi). The volatile keyword tell the compiler that > the value can change between thread operation and so it don't > do the new optimization. > > Did I explain clearly enough what happened? IMO it is more or less correct. I will try to explain problem and my road to solution with more details. <put on my Teacher Assistant hat> As some of you noted, few tests of GPUArray failed on Fermi. In the beginning I assumed (and wrote about it on this list on 2010-09-27 21:06) that this is caused because of changes in IEEE-754 implementation done in Fermi devices. Then jmcarval (sorry if I forgot your name) send test case (2010-09-28 9:29) that shown that there is something more in case of Fermi. I wanted to cause Fermi to run older code (email from 2010-09-28 23:56) but it was not possible - sm_2 device cannot run sm_11 code (further explanation in email from 2010-10-01 00:00). I started analysing code generated by nvcc and noticed that code for devices with sm_2 is optimised more agressively than code for sm_11 (Ion). For example it keeps more variables in registers, not reading them from the memory (email from 2010-10-01 0:40). But I have not found differences in the code that reduces, only in preparatory code. Then I started analysing reduction kernels. I have noticed that when there is large number of threads, code calls __syncthreads(), and when number of threads drops below 32 it stops synchronising threads. I added synchronisation to the code (first patch, synchronize.diff). I must admit that I totally forgot about warps - I was thinking in terms of blocks and grids and was baffled by groups of 32 threads. Then I went back to documentation, noticed that there are documents about Fermi, and found this "volatile" in chapter 1.2.2; voila, patch volatile.diff from the same email. Explanation why it works (at least according to my knowledge, I do not know intimate hardware details): In the Tesla architecture, each SM had 8 cures, so one warp was executed in 4 steps. There was also limited number of registers, so compiler generated code that was making sure that everything is held in memory, so there are registers ready to be used. Fermi (2.x capabilities) offer cache and more registers (32k of registers vs. 8k in 1.1 devices). This way compiler need not to worry about registers. So it optimises code, keeps values in registers, and flushed cache and registers only on thread barrier or on demand. It leads to faster code, but also to difference between what is in register and what is in memory. This is important in Fermi, as it has 32 cores per SM and can execute half-warps in different order, disregarding inter-thread dependencies. Situation gets even more interesting in 2.1 devices, which have 48 cores per SM. Keyword "volatile" tells compiler that value held in this variable may be changed by some mechanism that program does not control. In the "old times" it was used in programming hardware devices; hardware registers were declared as "volatile" variables, as their values were influenced by external device. Basically "volatile" disallows compiler from using optimisations described two paragraphs above. Each time program reads variable, it makes sure that it reads it from memory, not from cache or register. Each time program writes variable, it writes it to memory, not tu cache or register. This way any other thread will read proper, recently calculated value, not the old value. Maybe it is time to teach students again about "volatile", "register" and other types of variables in Introduction to Programming? If you want to learn more about cache coherency, and so on, watch Mare Jane Irving lecture about cache in multicore CPUs which she gave during ACM meeting when she received Athena Award: http://awards.acm.org/athena/2010 BTW I have GTX460 - Fermi with 2.1 CC. To not bore you with details, now I have new GPU, my brother has my previous card instead of his old broken one and everyone's computer is working. It was rather convenient timing, but I am not complaining - I got new GPU! I have also checked both patches (volatile.diff and synchronize.diff) and both of them work. I have checked that in 0.92.2 generated PTX uses ld.volatile.shared.x64 instead of ld.shared.x64 to fetch values from memory in the final reduction warp. It also uses more registers (48 instead of 43) in the entire kernel. I hope that it explains what is going on, what has changed in Fermi, and how it was corrected in PyCUDA. Regards -- Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860 Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 http://member.acm.org/~tomaszrybak
signature.asc
Description: This is a digitally signed message part
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda