Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Olivier Guilyardi wrote: > To figure whether memory barriers are needed or not, we need to run the test > on > various architectures. > > For this purpose, I've set up a small svn repo with everything to run the test > easily. There's no dependency, you just need pthread, gcc and sh. > > Here's how to run the test: > > svn co http://svn.samalyse.com/misc/rbtest > cd rbtest > make test i don't really understand the finer points of this discussion, but hey it's always great to join a technical thread. my favourite bikeshed color is yellow :-D ok, here comes the result for a single-cpu athlon64 4000+ in 64bit mode. [EMAIL PROTECTED]:/build/rbtest> make test gcc -Wall -I. -I./jack -lpthread -o test-int-array-jack \ test-int-array.c jack/ringbuffer.c test-int-array.c: In function ‘main’: test-int-array.c:101: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘long unsigned int’ gcc -Wall -I. -I./portaudio -lpthread -o test-int-array-portaudio \ test-int-array.c portaudio/ringbuffer.c portaudio/pa_ringbuffer.c test-int-array.c: In function ‘main’: test-int-array.c:101: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘long unsigned int’ gcc -Wall -I. -I./portaudio -lpthread -o test-int-array-portaudio-nobarrier \ -DNO_MEMORY_BARRIER \ test-int-array.c portaudio/ringbuffer.c portaudio/pa_ringbuffer.c test-int-array.c: In function ‘main’: test-int-array.c:101: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘long unsigned int’ ./alltests.sh Starting ringbuffer tests (buffer size: 512) === Jack ringbuffer test === starting ringbuffer stress test (2 minutes max) buffer size (bytes): 512 array size (bytes): 256 reader started on cpu 0 writer started on cpu: 0 49536 != 49408 at offset 0 failure in chunk 330502 === Portaudio ringbuffer test === starting ringbuffer stress test (2 minutes max) buffer size (bytes): 512 array size (bytes): 256 reader started on cpu 0 writer started on cpu: 0 Success === Portaudio ringbuffer test (without memory barriers) === starting ringbuffer stress test (2 minutes max) buffer size (bytes): 512 array size (bytes): 256 reader started on cpu 0 writer started on cpu: 0 Success -- jörn nettingsmeier home://germany/45128 essen/lortzingstr. 11/ http://spunk.dnsalias.org phone://+49/201/491621 Kurt is up in Heaven now. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Olivier Guilyardi wrote: > Here's how to run the test: > > svn co http://svn.samalyse.com/misc/rbtest > cd rbtest > make test on a pentium-m 32bit notebook: [EMAIL PROTECTED]:/local/rbtest> make test gcc -Wall -I. -I./jack -lpthread -o test-int-array-jack \ test-int-array.c jack/ringbuffer.c gcc -Wall -I. -I./portaudio -lpthread -o test-int-array-portaudio \ test-int-array.c portaudio/ringbuffer.c portaudio/pa_ringbuffer.c gcc -Wall -I. -I./portaudio -lpthread -o test-int-array-portaudio-nobarrier \ -DNO_MEMORY_BARRIER \ test-int-array.c portaudio/ringbuffer.c portaudio/pa_ringbuffer.c ./alltests.sh Starting ringbuffer tests (buffer size: 512) === Jack ringbuffer test === starting ringbuffer stress test (2 minutes max) buffer size (bytes): 512 array size (bytes): 256 reader started on cpu 0 writer started on cpu: 0 Success === Portaudio ringbuffer test === starting ringbuffer stress test (2 minutes max) buffer size (bytes): 512 array size (bytes): 256 reader started on cpu 0 writer started on cpu: 0 Success === Portaudio ringbuffer test (without memory barriers) === starting ringbuffer stress test (2 minutes max) buffer size (bytes): 512 array size (bytes): 256 reader started on cpu 0 writer started on cpu: 0 Success [EMAIL PROTECTED]:/local/rbtest> -- jörn nettingsmeier home://germany/45128 essen/lortzingstr. 11/ http://spunk.dnsalias.org phone://+49/201/491621 Kurt is up in Heaven now. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sat, Oct 18, 2008 at 5:30 PM, Fons Adriaensen <[EMAIL PROTECTED]> wrote: > On Sat, Oct 18, 2008 at 10:56:31PM +0200, Paul Davis wrote: > >> if read_ptr > size then the math should result in the write space being >> *under* estimated. if thats not happening then my worst nightmares come >> true, which has happened before. is there a signed/unsigned issue going >> on here? > > The only way to know is to verify all the possible cases in > jack_ringbuffer_write_space() and jack_ringbuffer_read_space(), > taking into account that the masking operation may not have > been applied at the time these are called, and that 'the other' > *_ptr could be >= size. > [Switched from LAU to LAD] Forget I said anything about context switches. Those don't matter. The concern here is SMP systems. The bug in the original code is += operator used to increment the read and write pointers. That operator generates the addl instruction, which is not atomic. It needs to be locked on SMP. The reason Olivier's patch works is because the increment is done on a temp and then assinged (using plain old =) to the shared variable. The assignment *is* atomic (since size_t is 4 bytes and presumably aligned to a 4-byte boundary). I can't prove this right now, but I think I'm correct. That means it may be possible to make the original code SMP-safe without the (subtle) change in semantics Olivier's patch makes. Like increment a temp, store it, mask the temp, store it again. You can't write lock-free data structures without atomic operations. While the x86 doesn't need any memory barriers (all stores are seen by all other CPUs), other platforms could (someone mentioned PowerPC). The code still needs to be patched for that; just make the x86 version no-ops. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sat, Oct 18, 2008 at 10:22 PM, Paul Coccoli <[EMAIL PROTECTED]> wrote: > [Switched from LAU to LAD] > > Forget I said anything about context switches. Those don't matter. > The concern here is SMP systems. The bug in the original code is += > operator used to increment the read and write pointers. That operator > generates the addl instruction, which is not atomic. It needs to be > locked on SMP. The reason Olivier's patch works is because the > increment is done on a temp and then assinged (using plain old =) to > the shared variable. The assignment *is* atomic (since size_t is 4 > bytes and presumably aligned to a 4-byte boundary). > > I can't prove this right now, but I think I'm correct. That means it > may be possible to make the original code SMP-safe without the > (subtle) change in semantics Olivier's patch makes. Like increment a > temp, store it, mask the temp, store it again. > > You can't write lock-free data structures without atomic operations. This is wrong. For the single reader, single writer case, atomic operations are *not* necessary. The bug, as was already pointed out, is due to storing the unmasked pointer in the ringbuffer, allowing the other thread to see it in an invalid state. > While the x86 doesn't need any memory barriers (all stores are seen by > all other CPUs), other platforms could (someone mentioned PowerPC). > The code still needs to be patched for that; just make the x86 version > no-ops. This is correct. Some other architectures (PowerPC, etc.) *do* require the memory barrier. So, for portability it should be included. -- joq ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sat, Oct 18, 2008 at 11:29 PM, Jack O'Quin <[EMAIL PROTECTED]> wrote: > This is wrong. For the single reader, single writer case, atomic operations > are *not* necessary. The bug, as was already pointed out, is due to storing Let's agree to disagree, then. Single-reader, single-writer does not automatically make something SMP safe. There is large body of literature on lock-free data structures that agrees with me; someone posted a link to a collection of those earlier in the thread. > the unmasked pointer in the ringbuffer, allowing the other thread to see it > in an invalid state. Paul Davis disagrees, and I have yet to come up with a scenario where read_ptr can be assigned a value larger than size. And I'm the one who pointed out the bug in the first place. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sat, Oct 18, 2008 at 10:45 PM, Paul Coccoli <[EMAIL PROTECTED]> wrote: > On Sat, Oct 18, 2008 at 11:29 PM, Jack O'Quin <[EMAIL PROTECTED]> wrote: >> This is wrong. For the single reader, single writer case, atomic operations >> are *not* necessary. The bug, as was already pointed out, is due to storing > > Let's agree to disagree, then. Single-reader, single-writer does not > automatically make something SMP safe. There is large body of > literature on lock-free data structures that agrees with me; someone > posted a link to a collection of those earlier in the thread. Let's not. This is not just a matter of opinion. If you read that literature, you will find that the ring buffer *is* safe for the single reader, single writer case. In many other SMP situations, atomic operations *are* required, but not for ring buffers. However, for non-x86 machines which do not provide strong cache coherency between processors (NUMA, PowerPC, etc.), I believe it *is* possible for the read thread to look at buffer locations whose contents have been written by the write thread, but those data may not yet be visible to the other processor. This is why memory barriers may be needed on some machines. For portability, we should include them in the code. On many machines, they will be no-ops, but that's OK. >> the unmasked pointer in the ringbuffer, allowing the other thread to see it >> in an invalid state. > > Paul Davis disagrees, and I have yet to come up with a scenario where > read_ptr can be assigned a value larger than size. And I'm the one > who pointed out the bug in the first place. If the amount read or written exactly fills the buffer, then a read or write pointer equal to the last entry plus one will be stored momentarily, before the correct (masked) value wraps it back to the beginning of the buffer. If the other thread looks at it in that state, I believe data will be copied outside the bounds of the buffer, which is bad. Note that this is a bug even in the uni-processor case. SMP merely makes it more likely. -- joq ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sat, 2008-10-18 at 23:24 -0500, Jack O'Quin wrote: > If the amount read or written exactly fills the buffer, then a read or write > pointer equal to the last entry plus one will be stored momentarily, before > the correct (masked) value wraps it back to the beginning of the buffer. If > the other thread looks at it in that state, I believe data will be > copied outside > the bounds of the buffer, which is bad. this is indeed the case. for read space, if the write ptr is "ahead" of the read ptr, we return write_ptr - read_ptr, which will return a value that is too large if "write_ptr" gets snapshotted after the increment and before the mask. the same is true for writing. if the read ptr is ahead of the write ptr, we return the difference between them minus one (the extra -1 stops the write ptr catching up with the read ptr, a condition that indicates that the buffer is empty). this value will also be too large if the read_ptr is used in its "intermediate" state. i am deeply embarrassed but also puzzled. i remember worrying about precisely this issue on and over across a period of several years, and yet satisfying myself that there actually wasn't a problem. i think that my mistake was as follows: if you look at the read space case, but focus on whether the *read* ptr is too large at that time, then the safety of the ringbuffer is still guaranteed. ditto, in the write space case, the write ptr being too large doesn't violate the safety guarantee. but this is backwards - the thread that computes write space is the writing thread, which is the only one that modifies write_ptr. ergo, it never sees write_ptr as volatile. ditto, the reading thread never sees read_ptr as volatile. the problems arise because of the *other* pointer, where the temporary intermediate state *does* cause the computations to be in error. i don't know whether to shoot myself or eat another couple of the oft-promised hats. so the next question is how best to prevent it. as far as i can see we have a couple of proposals: 1) fons' design, which never actually wraps readptr or writeptr, but masks the address used to access the data buffer 2) removing the intermediate state's visibility i admit to preferring (2) even though i know that with a 64 bit index, not wrapping the ptrs is not really a problem. however, it is not totally clear to me how to prevent an optimizing compiler from doing the wrong thing here. unlike the claims made by someone involved with portaudio, i believe that it is correct to declare the read_ptr and write_ptr volatile, so that the compiler knows that it cannot try to be clever about it accesses of the "other" ptr (i.e. read_ptr for the writing thread, and vice versa). maybe the comment on the use of volatile was based on some idea that i thought it made the variables thread safe, which is and never was the case. i suspect that the safest code looks like: size_t tmp = (ptr + incr) & mask; barrier(); ptr = tmp; but i am not sure whether barrier needs to be read, write or both. i think that the simpler code: ptr = (ptr + incr) & mask; is subject to potential compiler and/or processor "optimization" that might reduce it back to the problem case of two ops without an intermediate load/store location. the volatile declaration ought to prevent the compiler from doing this, and i don't see why a processor would do this, ever, but clearly i've already been deeply wrong about this. does anybody know for certain? --p ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sun, 2008-10-19 at 09:55 +0200, Paul Davis wrote: > i am deeply embarrassed but also puzzled. maybe i don't need to eat any hats except for the "paying insufficient attention" one. the jack ringbuffer code was ported from my original C++ implementation, which specifically does not have the problem of visible intermediate state (and, btw, does force the use of atomic loads & stores). --p ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sun, Oct 19, 2008 at 09:55:54AM +0200, Paul Davis wrote: >1) fons' design, which never actually wraps readptr or writeptr, but > masks the address used to access the data buffer > >2) removing the intermediate state's visibility > > i admit to preferring (2) even though i know that with a 64 bit index, > not wrapping the ptrs is not really a problem. Another advantage of 1 is that it has no problem with a completely full buffer, and this doesn't even require any special code. If access is always in 2^N blocks, it can store one block more. > i think that the simpler code: > >ptr = (ptr + incr) & mask; > > is subject to potential compiler and/or processor "optimization" that > might reduce it back to the problem case of two ops without an > intermediate load/store location. You mean _with_ and intermediate store (I hope) ? Adding a intermediate store would not really be an 'optimisation'... I doubt if any compiler would do it. But you can prevent it from doing so, see below. > the volatile declaration ought to prevent the compiler from > doing this It does. Declaring a var volatile means that the compiler MUST assume that reading or writing it can have side effects, and that therefore it MUST access the var exactly as specified. For the new code this means that only the correct value will ever be written, for the old it forced the compiler to write the wrong one. Ciao, -- FA Laboratorio di Acustica ed Elettroacustica Parma, Italia Lascia la spina, cogli la rosa. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sun, Oct 19, 2008 at 2:55 AM, Paul Davis <[EMAIL PROTECTED]> wrote: > i don't know whether to shoot myself or eat another couple of the > oft-promised hats. Don't beat yourself up too badly. Multiple threads accessing shared memory is *very* tricky. Even smart people (like you) get it wrong sometimes. > so the next question is how best to prevent it. as far as i can see we > have a couple of proposals: > > 1) fons' design, which never actually wraps readptr or writeptr, but > masks the address used to access the data buffer > > 2) removing the intermediate state's visibility > > i admit to preferring (2) even though i know that with a 64 bit index, > not wrapping the ptrs is not really a problem. I also prefer (2). > however, it is not totally clear to me how to prevent an optimizing > compiler from doing the wrong thing here. unlike the claims made by > someone involved with portaudio, i believe that it is correct to declare > the read_ptr and write_ptr volatile, so that the compiler knows that it > cannot try to be clever about it accesses of the "other" ptr (i.e. > read_ptr for the writing thread, and vice versa). maybe the comment on > the use of volatile was based on some idea that i thought it made the > variables thread safe, which is and never was the case. > > i suspect that the safest code looks like: > >size_t tmp = (ptr + incr) & mask; >barrier(); >ptr = tmp; > > but i am not sure whether barrier needs to be read, write or both. > > i think that the simpler code: > > ptr = (ptr + incr) & mask; > > is subject to potential compiler and/or processor "optimization" that > might reduce it back to the problem case of two ops without an > intermediate load/store location. the volatile declaration ought to > prevent the compiler from doing this, and i don't see why a processor > would do this, ever, but clearly i've already been deeply wrong about > this. does anybody know for certain? It is best to avoid assumptions about what some future compiler may consider an "optimization". If the register pressure is high at some point in the program, it may decide to store some value just to free up register space for other variables. The "volatile" declaration should remove any need for the compiler barrier() statement, AFAICT. Note that barrier() is a compiler directive, and has no effect on the CPU's ability to reorder cache operations in an SMP memory hierarchy. Here is a fairly clear and complete description of the memory barrier issues: http://lxr.linux.no/linux/Documentation/memory-barriers.txt As best I can tell, both ring buffer threads require "general" memory barriers, because they both read and write (different) shared data. In Linux kernel terms, they'd need to use smp_mb(), since the problems they address are multiprocessor-only. But, since JACK is not compiled differently for UP and SMP, the full mb() seems more appropriate. That stuff makes my head hurt. -- joq ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sun, Oct 19, 2008 at 12:24 AM, Jack O'Quin <[EMAIL PROTECTED]> wrote: > On Sat, Oct 18, 2008 at 10:45 PM, Paul Coccoli <[EMAIL PROTECTED]> wrote: >> On Sat, Oct 18, 2008 at 11:29 PM, Jack O'Quin <[EMAIL PROTECTED]> wrote: >>> This is wrong. For the single reader, single writer case, atomic operations >>> are *not* necessary. The bug, as was already pointed out, is due to storing >> >> Let's agree to disagree, then. Single-reader, single-writer does not >> automatically make something SMP safe. There is large body of >> literature on lock-free data structures that agrees with me; someone >> posted a link to a collection of those earlier in the thread. > > Let's not. This is not just a matter of opinion. If you read that > literature, > you will find that the ring buffer *is* safe for the single reader, > single writer > case. In many other SMP situations, atomic operations *are* required, > but not for ring buffers. The only time you can get away without atomic ops is on uni-processor. Please cite a reference that says otherwise. Notice that all the fixes proposed all involve removing the "+=" and using only assignment. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sun, Oct 19, 2008 at 12:36:55PM -0400, Paul Coccoli wrote: > The only time you can get away without atomic ops is on uni-processor. > Please cite a reference that says otherwise. Plaese show us how using a non-atomic addition could go wrong. > Notice that all the fixes proposed all involve removing the "+=" and > using only assignment. First, that is not the same as requiring atomic addition. Second, it is not true. The one I proposed just does the '+='. If it's wrong show us how. Ciao, -- FA Laboratorio di Acustica ed Elettroacustica Parma, Italia Lascia la spina, cogli la rosa. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Sun, Oct 19, 2008 at 1:08 PM, Fons Adriaensen <[EMAIL PROTECTED]> wrote: > On Sun, Oct 19, 2008 at 12:36:55PM -0400, Paul Coccoli wrote: > >> The only time you can get away without atomic ops is on uni-processor. >> Please cite a reference that says otherwise. > > Plaese show us how using a non-atomic addition could go wrong. On a side note, does anyone know what the performance penalty is (if any) for using atomic ops? And does it scale according to number of CPU cores? What other factors are there? I assume the caching architecture makes a big difference. I've been using the atomic-ops library from HP for doing various things lately, and I find it quite nice. But I have found myself wondering whether I am paying some kind of penalty. Of course, I'm sure it's less than the penalty for using locks. Steve ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Jack O'Quin wrote: > On Sun, Oct 19, 2008 at 2:55 AM, Paul Davis <[EMAIL PROTECTED]> wrote: > >> i don't know whether to shoot myself or eat another couple of the >> oft-promised hats. > > Don't beat yourself up too badly. Multiple threads accessing shared memory > is *very* tricky. Even smart people (like you) get it wrong sometimes. I agree with Jack. I remember a friend of mine once qualifying multi-threading of "kernel quantization". This is complex but also challenging :) Although your understanding of these matters is far beyond mine, I think that I have a good professional experience of coding methodology. And to me the problem here is indeed methodological: there is a lack of unit tests in JACK. >> so the next question is how best to prevent it. as far as i can see we >> have a couple of proposals: >> >> 1) fons' design, which never actually wraps readptr or writeptr, but >> masks the address used to access the data buffer >> >> 2) removing the intermediate state's visibility >> >> i admit to preferring (2) even though i know that with a 64 bit index, >> not wrapping the ptrs is not really a problem. > > I also prefer (2). > >> however, it is not totally clear to me how to prevent an optimizing >> compiler from doing the wrong thing here. unlike the claims made by >> someone involved with portaudio, i believe that it is correct to declare >> the read_ptr and write_ptr volatile, so that the compiler knows that it >> cannot try to be clever about it accesses of the "other" ptr (i.e. >> read_ptr for the writing thread, and vice versa). maybe the comment on >> the use of volatile was based on some idea that i thought it made the >> variables thread safe, which is and never was the case. >> >> i suspect that the safest code looks like: >> >>size_t tmp = (ptr + incr) & mask; >>barrier(); >>ptr = tmp; >> >> but i am not sure whether barrier needs to be read, write or both. >> >> i think that the simpler code: >> >> ptr = (ptr + incr) & mask; >> >> is subject to potential compiler and/or processor "optimization" that >> might reduce it back to the problem case of two ops without an >> intermediate load/store location. the volatile declaration ought to >> prevent the compiler from doing this, and i don't see why a processor >> would do this, ever, but clearly i've already been deeply wrong about >> this. does anybody know for certain? > > It is best to avoid assumptions about what some future compiler may > consider an "optimization". If the register pressure is high at some > point in the program, it may decide to store some value just to free up > register space for other variables. > > The "volatile" declaration should remove any need for the compiler > barrier() statement, AFAICT. Note that barrier() is a compiler > directive, and has no effect on the CPU's ability to reorder cache > operations in an SMP memory hierarchy. > > Here is a fairly clear and complete description of the memory barrier > issues: http://lxr.linux.no/linux/Documentation/memory-barriers.txt > > As best I can tell, both ring buffer threads require "general" memory > barriers, because they both read and write (different) shared data. In > Linux kernel terms, they'd need to use smp_mb(), since the problems > they address are multiprocessor-only. But, since JACK is not compiled > differently for UP and SMP, the full mb() seems more appropriate. > > That stuff makes my head hurt. Maybe that is because your approach is too theoretical. I've read many theories about whether memory barriers are needed or not in lock-free ring buffers, and I think it might lead nowhere without experimenting, that is: write test cases. Until now, I just couldn't write a test that fail because of the lack of memory barrier on x86. However, I think I found another bug in Jack ringbuffer, by writing another test. It's a bit of a weird test, I call it "bit circle": I start 16 "node" threads. Each node : - communicate with its adjacent node through 2 ring buffers - is responsible for shifting an integer by one bit - send the shifted result to the next node through a ring buffer - checks that the value read from previous node is correct. On my Debian Quad Core and Mac OS X Core Duo boxes, this test fails with Jack's ringbuffer even with my patch applied. But it succeeds with Portaudio (both with and without memory barriers), and also with Fons' implementation. I wish someone could eventually run my test suite on a PowerPC, especially SMP. The usage is unchanged: svn co http://svn.samalyse.com/misc/rbtest cd rbtest make test The run time is now shorter, about 2 minutes. Below is the output. You can see that jack (test-bit-circle-jack) fails the bit circle test, even when patched (test-bit-circle-jack-fix1). "|" is printed when a node thread starts, and "-" when a node has read 1000 values, to ensure data is really flowing. The test-int-array-* tests are the same as my original test. All *-lfq tests
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Hi Olivier, On Oct 20, 2008, at 6:13 AM, Olivier Guilyardi wrote: > Until now, I just couldn't write a test that fail because of the > lack of memory > barrier on x86. However, I think I found another bug in Jack > ringbuffer, by > writing another test. > > It's a bit of a weird test, I call it "bit circle": I start 16 > "node" threads. > Each node : > - communicate with its adjacent node through 2 ring buffers > - is responsible for shifting an integer by one bit > - send the shifted result to the next node through a ring buffer > - checks that the value read from previous node is correct. > > On my Debian Quad Core and Mac OS X Core Duo boxes, this test fails > with Jack's > ringbuffer even with my patch applied. But it succeeds with > Portaudio (both with > and without memory barriers), and also with Fons' implementation. > > I wish someone could eventually run my test suite on a PowerPC, > especially SMP. > > The usage is unchanged: > svn co http://svn.samalyse.com/misc/rbtest > cd rbtest > make test > > The run time is now shorter, about 2 minutes. Below is the output. > You can see > that jack (test-bit-circle-jack) fails the bit circle test, even > when patched > (test-bit-circle-jack-fix1). "|" is printed when a node thread > starts, and "-" > when a node has read 1000 values, to ensure data is really flowing. > > The test-int-array-* tests are the same as my original test. All *- > lfq tests use > Fons's Lfq ringbuffer (modified to use char instead of uint32_t) as > backend. > > ./run-tests.sh bit-circle 512 jack jack-fix1 portaudio portaudio- > nobarrier lfq > > test-bit-circle-jack - starting (5s max) - buffer size: 512 > || FAILURE: 2 != 514 I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it returns FAILURE: 2 != 0. Endianness bug? No, but it's a clue... I don't think you're seeing a *bug* in the jack ringbuffer, just a difference in semantics between it and the PortAudio and lfq imlementations. Both PA and lfq mask the index after reading it, so they can completely fill the buffer. The jack implementation masks the stored index, which means it can't ever completely fill the buffer, or there would be an ambiguity as to whether it was full or empty. The most it can store is (size - 1) bytes. In the jack version of your code, at some point, the buffer almost fills, and jack_ringbuffer_write only writes a single byte to the buffer (low order on x86, high order on PPC). Next time around, it does this again. Then the following thread reads, not 0x0002 like it should, but either 0x0202 (x86) or 0x (PPC). Paul said he prefers storing the masked pointer, but I think Fons & PA's approach is better: it allows for full use of the buffer. -Sean ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Sean Bolton wrote: > Hi Olivier, > > On Oct 20, 2008, at 6:13 AM, Olivier Guilyardi wrote: >> Until now, I just couldn't write a test that fail because of the >> lack of memory >> barrier on x86. However, I think I found another bug in Jack >> ringbuffer, by >> writing another test. >> >> It's a bit of a weird test, I call it "bit circle": I start 16 >> "node" threads. >> Each node : >> - communicate with its adjacent node through 2 ring buffers >> - is responsible for shifting an integer by one bit >> - send the shifted result to the next node through a ring buffer >> - checks that the value read from previous node is correct. >> >> On my Debian Quad Core and Mac OS X Core Duo boxes, this test fails >> with Jack's >> ringbuffer even with my patch applied. But it succeeds with >> Portaudio (both with >> and without memory barriers), and also with Fons' implementation. >> >> I wish someone could eventually run my test suite on a PowerPC, >> especially SMP. >> >> The usage is unchanged: >> svn co http://svn.samalyse.com/misc/rbtest >> cd rbtest >> make test >> >> The run time is now shorter, about 2 minutes. Below is the output. >> You can see >> that jack (test-bit-circle-jack) fails the bit circle test, even >> when patched >> (test-bit-circle-jack-fix1). "|" is printed when a node thread >> starts, and "-" >> when a node has read 1000 values, to ensure data is really flowing. >> >> The test-int-array-* tests are the same as my original test. All *- >> lfq tests use >> Fons's Lfq ringbuffer (modified to use char instead of uint32_t) as >> backend. >> >> ./run-tests.sh bit-circle 512 jack jack-fix1 portaudio portaudio- >> nobarrier lfq >> >> test-bit-circle-jack - starting (5s max) - buffer size: 512 >> || FAILURE: 2 != 514 > > I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it > returns > FAILURE: 2 != 0. Endianness bug? No, but it's a clue... > > I don't think you're seeing a *bug* in the jack ringbuffer, just a > difference in > semantics between it and the PortAudio and lfq imlementations. Both PA > and lfq mask the index after reading it, so they can completely fill > the buffer. > The jack implementation masks the stored index, which means it can't > ever completely fill the buffer, or there would be an ambiguity as to > whether > it was full or empty. The most it can store is (size - 1) bytes. > > In the jack version of your code, at some point, the buffer almost > fills, and > jack_ringbuffer_write only writes a single byte to the buffer (low > order on x86, > high order on PPC). Next time around, it does this again. Then the > following > thread reads, not 0x0002 like it should, but either 0x0202 (x86) or > 0x (PPC). > I can confirm this, since the following will make the test 'succeed', although it results in a very slow progress once the buffers are full. Index: test-bit-circle.c === --- test-bit-circle.c (revision 314) +++ test-bit-circle.c (working copy) @@ -50,6 +50,9 @@ if (value) { + while(jack_ringbuffer_write_space(rb[1]) < 2) { +usleep(1000); + } jack_ringbuffer_write (rb[1], (char *) &value, sizeof (uint16_t)); value = 0; } Pieter ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Hi, Pieter Palmers a écrit : > Sean Bolton wrote: >> >>> test-bit-circle-jack - starting (5s max) - buffer size: 512 >>> || FAILURE: 2 != 514 >> >> I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it >> returns >> FAILURE: 2 != 0. Endianness bug? No, but it's a clue... Do you have a PPC? Which one? Can you please paste the whole output of "make test"? >> I don't think you're seeing a *bug* in the jack ringbuffer, just a >> difference in >> semantics between it and the PortAudio and lfq imlementations. Both PA >> and lfq mask the index after reading it, so they can completely fill >> the buffer. >> The jack implementation masks the stored index, which means it can't >> ever completely fill the buffer, or there would be an ambiguity as to >> whether >> it was full or empty. The most it can store is (size - 1) bytes. >> >> In the jack version of your code, at some point, the buffer almost >> fills, and >> jack_ringbuffer_write only writes a single byte to the buffer (low >> order on x86, >> high order on PPC). Next time around, it does this again. Then the >> following >> thread reads, not 0x0002 like it should, but either 0x0202 (x86) or >> 0x (PPC). >> > > I can confirm this, since the following will make the test 'succeed', > although it results in a very slow progress once the buffers are full. > > Index: test-bit-circle.c > === > --- test-bit-circle.c (revision 314) > +++ test-bit-circle.c (working copy) > @@ -50,6 +50,9 @@ > > if (value) > { > + while(jack_ringbuffer_write_space(rb[1]) < 2) { > +usleep(1000); > + } >jack_ringbuffer_write (rb[1], (char *) &value, sizeof (uint16_t)); >value = 0; > } I know my code performs much more write than read operations, I did that on purpose, while playing with it. Now, back to reality, let's say that you have a thread that synthetizes some sound, puts it into a ringbuffer, for an audio thread to read and play it. Let's also say that the synth thread performs much faster than the audio thread, so that the buffer is almost always full. I think this is quite realistic. In these conditions, if I understand correctly the result of this test, there is a chance for the output of the ringbuffer to differ from its input. Isn't that some sort of bug? -- Olivier Guilyardi / Samalyse ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Olivier Guilyardi a écrit : > Hi, > > Pieter Palmers a écrit : >> Sean Bolton wrote: >>> test-bit-circle-jack - starting (5s max) - buffer size: 512 || FAILURE: 2 != 514 >>> >>> I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it >>> returns >>> FAILURE: 2 != 0. Endianness bug? No, but it's a clue... > > Do you have a PPC? Which one? Can you please paste the whole output of > "make test"? > >>> I don't think you're seeing a *bug* in the jack ringbuffer, just a >>> difference in >>> semantics between it and the PortAudio and lfq imlementations. Both PA >>> and lfq mask the index after reading it, so they can completely fill >>> the buffer. >>> The jack implementation masks the stored index, which means it can't >>> ever completely fill the buffer, or there would be an ambiguity as >>> to whether >>> it was full or empty. The most it can store is (size - 1) bytes. >>> >>> In the jack version of your code, at some point, the buffer almost >>> fills, and >>> jack_ringbuffer_write only writes a single byte to the buffer (low >>> order on x86, >>> high order on PPC). Next time around, it does this again. Then the >>> following >>> thread reads, not 0x0002 like it should, but either 0x0202 (x86) or >>> 0x (PPC). >>> >> >> I can confirm this, since the following will make the test 'succeed', >> although it results in a very slow progress once the buffers are full. >> >> Index: test-bit-circle.c >> === >> --- test-bit-circle.c (revision 314) >> +++ test-bit-circle.c (working copy) >> @@ -50,6 +50,9 @@ >> >> if (value) >> { >> + while(jack_ringbuffer_write_space(rb[1]) < 2) { >> +usleep(1000); >> + } >>jack_ringbuffer_write (rb[1], (char *) &value, sizeof (uint16_t)); >>value = 0; >> } > > I know my code performs much more write than read operations, I did that > on purpose, while playing with it. > > Now, back to reality, let's say that you have a thread that synthetizes > some sound, puts it into a ringbuffer, for an audio thread to read and > play it. Let's also say that the synth thread performs much faster than > the audio thread, so that the buffer is almost always full. I think this > is quite realistic. > > In these conditions, if I understand correctly the result of this test, > there is a chance for the output of the ringbuffer to differ from its > input. Isn't that some sort of bug? > Hmm, wait, I think I understand your point.. With the Jack implementation, in a buffer of 16 bytes, you can really put 15 bytes. I didn't notice this. And since I do not check if there's enough space to write my 16-bits integer, when the buffer is almost full, it only writes one byte... So, well, yes it looks like this is no bug... Sorry about confusion. I've fixed this (r315). This test now indeed passes on all backends. What about the test-int-array-* family of tests on PPC? -- Olivier Guilyardi / Samalys ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Oct 20, 2008, at 3:18 PM, Olivier Guilyardi wrote: > What about the test-int-array-* family of tests on PPC? They all pass on my uniprocessor G4, but that doesn't really tell us much. Anybody got a multiprocessor PPC? This stuff makes my head hurt, too, but as best I can figure out, I believe the jack code needs memory barriers on several non-x86 architectures with weak memory ordering, PPC probably being the most common. It appears to me that without barriers, there's a small chance reordered writes on one processor could result in another processor reading invalid data: CPU1: CPU2: write index --- write data2 read index write data3 read data1 <- invalid write data1 read data2 --- read data3 x Paul E. McKenney Memory Ordering in Modern Microprocessors, Part I http://www.linuxjournal.com/article/8211 Memory Ordering in Modern Microprocessors, Part II http://www.linuxjournal.com/article/8212 Linus Torvalds Re: Memory barriers and spin_unlock safety http://lkml.org/lkml/2006/3/3/142 Parts of this *thread* Re: Re: [liblf-dev] Possible lock-free ring buffer: msg#00152 http://techweb.rfa.org/pipermail/portaudio/2006-May/005661.html ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
Sean Bolton wrote: > On Oct 20, 2008, at 3:18 PM, Olivier Guilyardi wrote: >> What about the test-int-array-* family of tests on PPC? > > They all pass on my uniprocessor G4, but that doesn't > really tell us much. Anybody got a multiprocessor PPC? I think it might be worth to try the following on your G4: - run the test *much* longer (change the sleep call at the end of test-int-array.c). Having experimented a lot with my tests, I know that failures appear after a truly random amount of time. - induce entropy, move your mouse, compose music, make a coffee, kick your computer... ;-) - experiment with compiler optimization (-O1, -O2 or -O3, ... in the Makefile's CFLAGS). I'm unsure about this, but it's been mentioned many times on this thread and I think it's worth trying. PS: I'm CC-ing this message to jack-devel as Joern (a great community engineer I know ;) recommended. -- Olivier Guilyardi / Samalyse ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
For the records, here is the output from a dual IBM Cell system (QS-21), each PowerPC unit is exposed with two cores. jlc $ cat /proc/cpuinfo processor : 0 cpu : Cell Broadband Engine, altivec supported clock : 3200.00MHz revision: 5.1 (pvr 0070 0501) processor : 1 cpu : Cell Broadband Engine, altivec supported clock : 3200.00MHz revision: 5.1 (pvr 0070 0501) processor : 2 cpu : Cell Broadband Engine, altivec supported clock : 3200.00MHz revision: 5.1 (pvr 0070 0501) processor : 3 cpu : Cell Broadband Engine, altivec supported clock : 3200.00MHz revision: 5.1 (pvr 0070 0501) timebase: 14318000 platform: Cell machine : CHRP MCS,-n/a = ./run-tests.sh bit-circle 512 jack jack-fix1 portaudio portaudio-nobarrier lfq test-bit-circle-jack - starting (5s max) - buffer size: 512 ||-||--- SUCCESS test-bit-circle-jack-fix1 - starting (5s max) - buffer size: 512 ---- SUCCESS test-bit-circle-portaudio - starting (5s max) - buffer size: 512 ||-|||-||--| SUCCESS test-bit-circle-portaudio-nobarrier - starting (5s max) - buffer size: 512 -|-|-||-||--||-- SUCCESS test-bit-circle-lfq - starting (5s max) - buffer size: 512 ||-|||-||-|- SUCCESS ./run-tests.sh int-array 16 jack jack-fix1 portaudio portaudio-nobarrier lfq test-int-array-jack - starting (120s max) - array/buffer size: 8/16 [reader started] [writer started] [flowing] 29364 != 29360 at offset 0 FAILURE in chunk 702810 test-int-array-jack-fix1 - starting (120s max) - array/buffer size: 8/16 [reader started] [writer started] [flowing] SUCCESS test-int-array-portaudio - starting (120s max) - array/buffer size: 8/16 [reader started] [writer started] [flowing] SUCCESS test-int-array-portaudio-nobarrier - starting (120s max) - array/buffer size: 8/16 [reader started] [writer started] [flowing] SUCCESS test-int-array-lfq - starting (120s max) - array/buffer size: 8/16 [reader started] [writer started] [flowing] SUCCESS ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?
On Mon, Oct 27, 2008 at 4:38 PM, Jesse Chappell <[EMAIL PROTECTED]> wrote: > For the records, here is the output from a dual IBM Cell system > (QS-21), each PowerPC unit is exposed with two cores. This is actually a QS-20, sorry. jlc ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev