Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-17 Thread Joern Nettingsmeier
Olivier Guilyardi wrote:

> To figure whether memory barriers are needed or not, we need to run the test 
> on
> various architectures.
> 
> For this purpose, I've set up a small svn repo with everything to run the test
> easily. There's no dependency, you just need pthread, gcc and sh.
> 
> Here's how to run the test:
> 
> svn co http://svn.samalyse.com/misc/rbtest
> cd rbtest
> make test

i don't really understand the finer points of this discussion, but hey
it's always great to join a technical thread. my favourite bikeshed
color is yellow :-D

ok, here comes the result for a single-cpu athlon64 4000+ in 64bit mode.

[EMAIL PROTECTED]:/build/rbtest> make test
gcc -Wall -I. -I./jack -lpthread -o test-int-array-jack \
test-int-array.c jack/ringbuffer.c
test-int-array.c: In function ‘main’:
test-int-array.c:101: warning: format ‘%d’ expects type ‘int’, but
argument 2 has type ‘long unsigned int’
gcc -Wall -I. -I./portaudio -lpthread -o test-int-array-portaudio \
test-int-array.c portaudio/ringbuffer.c
portaudio/pa_ringbuffer.c
test-int-array.c: In function ‘main’:
test-int-array.c:101: warning: format ‘%d’ expects type ‘int’, but
argument 2 has type ‘long unsigned int’
gcc -Wall -I. -I./portaudio -lpthread -o
test-int-array-portaudio-nobarrier \
-DNO_MEMORY_BARRIER \
test-int-array.c portaudio/ringbuffer.c
portaudio/pa_ringbuffer.c
test-int-array.c: In function ‘main’:
test-int-array.c:101: warning: format ‘%d’ expects type ‘int’, but
argument 2 has type ‘long unsigned int’
./alltests.sh
Starting ringbuffer tests (buffer size: 512)

=== Jack ringbuffer test ===
starting ringbuffer stress test (2 minutes max)
buffer size (bytes): 512
array size (bytes): 256
reader started on cpu 0
writer started on cpu: 0
49536 != 49408 at offset 0
failure in chunk 330502

=== Portaudio ringbuffer test ===
starting ringbuffer stress test (2 minutes max)
buffer size (bytes): 512
array size (bytes): 256
reader started on cpu 0
writer started on cpu: 0
Success

=== Portaudio ringbuffer test (without memory barriers) ===
starting ringbuffer stress test (2 minutes max)
buffer size (bytes): 512
array size (bytes): 256
reader started on cpu 0
writer started on cpu: 0
Success


-- 
jörn nettingsmeier

home://germany/45128 essen/lortzingstr. 11/
http://spunk.dnsalias.org
phone://+49/201/491621

Kurt is up in Heaven now.
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-17 Thread Joern Nettingsmeier
Olivier Guilyardi wrote:


> Here's how to run the test:
> 
> svn co http://svn.samalyse.com/misc/rbtest
> cd rbtest
> make test

on a pentium-m 32bit notebook:

[EMAIL PROTECTED]:/local/rbtest> make test
gcc -Wall -I. -I./jack -lpthread -o test-int-array-jack \
test-int-array.c jack/ringbuffer.c
gcc -Wall -I. -I./portaudio -lpthread -o test-int-array-portaudio \
test-int-array.c portaudio/ringbuffer.c
portaudio/pa_ringbuffer.c
gcc -Wall -I. -I./portaudio -lpthread -o
test-int-array-portaudio-nobarrier \
-DNO_MEMORY_BARRIER \
test-int-array.c portaudio/ringbuffer.c
portaudio/pa_ringbuffer.c
./alltests.sh
Starting ringbuffer tests (buffer size: 512)

=== Jack ringbuffer test ===
starting ringbuffer stress test (2 minutes max)
buffer size (bytes): 512
array size (bytes): 256
reader started on cpu 0
writer started on cpu: 0
Success

=== Portaudio ringbuffer test ===
starting ringbuffer stress test (2 minutes max)
buffer size (bytes): 512
array size (bytes): 256
reader started on cpu 0
writer started on cpu: 0
Success

=== Portaudio ringbuffer test (without memory barriers) ===
starting ringbuffer stress test (2 minutes max)
buffer size (bytes): 512
array size (bytes): 256
reader started on cpu 0
writer started on cpu: 0
Success
[EMAIL PROTECTED]:/local/rbtest>

-- 
jörn nettingsmeier

home://germany/45128 essen/lortzingstr. 11/
http://spunk.dnsalias.org
phone://+49/201/491621

Kurt is up in Heaven now.
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-18 Thread Paul Coccoli
On Sat, Oct 18, 2008 at 5:30 PM, Fons Adriaensen <[EMAIL PROTECTED]> wrote:
> On Sat, Oct 18, 2008 at 10:56:31PM +0200, Paul Davis wrote:
>
>> if read_ptr > size then the math should result in the write space being
>> *under* estimated. if thats not happening then my worst nightmares come
>> true, which has happened before. is there a signed/unsigned issue going
>> on here?
>
> The only way to know is to verify all the possible cases in
> jack_ringbuffer_write_space() and jack_ringbuffer_read_space(),
> taking into account that the masking operation may not have
> been applied at the time these are called, and that 'the other'
> *_ptr could be >= size.
>

[Switched from LAU to LAD]

Forget I said anything about context switches.  Those don't matter.
The concern here is SMP systems.  The bug in the original code is +=
operator used to increment the read and write pointers.  That operator
generates the addl instruction, which is not atomic.  It needs to be
locked on SMP.  The reason Olivier's patch works is because the
increment is done on a temp and then assinged (using plain old =) to
the shared variable.  The assignment *is* atomic (since size_t is 4
bytes and presumably aligned to a 4-byte boundary).

I can't prove this right now, but I think I'm correct.  That means it
may be possible to make the original code SMP-safe without the
(subtle) change in semantics Olivier's patch makes.  Like increment a
temp, store it, mask the temp, store it again.

You can't write lock-free data structures without atomic operations.

While the x86 doesn't need any memory barriers (all stores are seen by
all other CPUs), other platforms could (someone mentioned PowerPC).
The code still needs to be patched for that; just make the x86 version
no-ops.
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-18 Thread Jack O'Quin
On Sat, Oct 18, 2008 at 10:22 PM, Paul Coccoli <[EMAIL PROTECTED]> wrote:

> [Switched from LAU to LAD]
>
> Forget I said anything about context switches.  Those don't matter.
> The concern here is SMP systems.  The bug in the original code is +=
> operator used to increment the read and write pointers.  That operator
> generates the addl instruction, which is not atomic.  It needs to be
> locked on SMP.  The reason Olivier's patch works is because the
> increment is done on a temp and then assinged (using plain old =) to
> the shared variable.  The assignment *is* atomic (since size_t is 4
> bytes and presumably aligned to a 4-byte boundary).
>
> I can't prove this right now, but I think I'm correct.  That means it
> may be possible to make the original code SMP-safe without the
> (subtle) change in semantics Olivier's patch makes.  Like increment a
> temp, store it, mask the temp, store it again.
>
> You can't write lock-free data structures without atomic operations.

This is wrong.  For the single reader, single writer case, atomic operations
are *not* necessary.  The bug, as was already pointed out, is due to storing
the unmasked pointer in the ringbuffer, allowing the other thread to see it
in an invalid state.

> While the x86 doesn't need any memory barriers (all stores are seen by
> all other CPUs), other platforms could (someone mentioned PowerPC).
> The code still needs to be patched for that; just make the x86 version
> no-ops.

This is correct.  Some other architectures (PowerPC, etc.) *do*
require the memory barrier.  So, for portability it should be included.

-- 
 joq
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-18 Thread Paul Coccoli
On Sat, Oct 18, 2008 at 11:29 PM, Jack O'Quin <[EMAIL PROTECTED]> wrote:
> This is wrong.  For the single reader, single writer case, atomic operations
> are *not* necessary.  The bug, as was already pointed out, is due to storing

Let's agree to disagree, then.  Single-reader, single-writer does not
automatically make something SMP safe.  There is large body of
literature on lock-free data structures that agrees with me; someone
posted a link to a collection of those earlier in the thread.

> the unmasked pointer in the ringbuffer, allowing the other thread to see it
> in an invalid state.

Paul Davis disagrees, and I have yet to come up with a scenario where
read_ptr can be assigned a value larger than size.  And I'm the one
who pointed out the bug in the first place.
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-18 Thread Jack O'Quin
On Sat, Oct 18, 2008 at 10:45 PM, Paul Coccoli <[EMAIL PROTECTED]> wrote:
> On Sat, Oct 18, 2008 at 11:29 PM, Jack O'Quin <[EMAIL PROTECTED]> wrote:
>> This is wrong.  For the single reader, single writer case, atomic operations
>> are *not* necessary.  The bug, as was already pointed out, is due to storing
>
> Let's agree to disagree, then.  Single-reader, single-writer does not
> automatically make something SMP safe.  There is large body of
> literature on lock-free data structures that agrees with me; someone
> posted a link to a collection of those earlier in the thread.

Let's not.  This is not just a matter of opinion.  If you read that literature,
you will find that the ring buffer *is* safe for the single reader,
single writer
case.  In many other SMP situations, atomic operations *are* required,
but not for ring buffers.

However, for non-x86 machines which do not provide strong cache
coherency between processors (NUMA, PowerPC, etc.), I believe it *is*
possible for the read thread to look at buffer locations whose contents have
been written by the write thread, but those data may not yet be visible to
the other processor.  This is why memory barriers may be needed on some
machines.  For portability, we should include them in the code.  On many
machines, they will be no-ops, but that's OK.

>> the unmasked pointer in the ringbuffer, allowing the other thread to see it
>> in an invalid state.
>
> Paul Davis disagrees, and I have yet to come up with a scenario where
> read_ptr can be assigned a value larger than size.  And I'm the one
> who pointed out the bug in the first place.

If the amount read or written exactly fills the buffer, then a read or write
pointer equal to the last entry plus one will be stored momentarily, before
the correct (masked) value wraps it back to the beginning of the buffer.  If
the other thread looks at it in that state, I believe data will be
copied outside
the bounds of the buffer, which is bad.

Note that this is a bug even in the uni-processor case.  SMP merely makes
it more likely.
-- 
 joq
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Paul Davis
On Sat, 2008-10-18 at 23:24 -0500, Jack O'Quin wrote:

> If the amount read or written exactly fills the buffer, then a read or write
> pointer equal to the last entry plus one will be stored momentarily, before
> the correct (masked) value wraps it back to the beginning of the buffer.  If
> the other thread looks at it in that state, I believe data will be
> copied outside
> the bounds of the buffer, which is bad.

this is indeed the case. for read space, if the write ptr is "ahead" of
the read ptr, we return write_ptr - read_ptr, which will return a value
that is too large if "write_ptr" gets snapshotted after the increment
and before the mask. 

the same is true for writing. if the read ptr is ahead of the write ptr,
we return the difference between them minus one (the extra -1 stops the
write ptr catching up with the read ptr, a condition that indicates that
the buffer is empty). this value will also be too large if the read_ptr
is used in its "intermediate" state.

i am deeply embarrassed but also puzzled. i remember worrying about
precisely this issue on and over across a period of several years, and
yet satisfying myself that there actually wasn't a problem. i think that
my mistake was as follows: if you look at the read space case, but focus
on whether the *read* ptr is too large at that time, then the safety of
the ringbuffer is still guaranteed. ditto, in the write space case, the
write ptr being too large doesn't violate the safety guarantee. but this
is backwards - the thread that computes write space is the writing
thread, which is the only one that modifies write_ptr. ergo, it never
sees write_ptr as volatile. ditto, the reading thread never sees
read_ptr as volatile. the problems arise because of the *other* pointer,
where the temporary intermediate state *does* cause the computations to
be in error.

i don't know whether to shoot myself or eat another couple of the
oft-promised hats.

so the next question is how best to prevent it. as far as i can see we
have a couple of proposals:

   1) fons' design, which never actually wraps readptr or writeptr, but
  masks the address used to access the data buffer

   2) removing the intermediate state's visibility

i admit to preferring (2) even though i know that with a 64 bit index,
not wrapping the ptrs is not really a problem.

however, it is not totally clear to me how to prevent an optimizing
compiler from doing the wrong thing here. unlike the claims made by
someone involved with portaudio, i believe that it is correct to declare
the read_ptr and write_ptr volatile, so that the compiler knows that it
cannot try to be clever about it accesses of the "other" ptr (i.e.
read_ptr for the writing thread, and vice versa). maybe the comment on
the use of volatile was based on some idea that i thought it made the
variables thread safe, which is and never was the case.

i suspect that the safest code looks like:

size_t tmp = (ptr + incr) & mask;
barrier(); 
ptr = tmp;

but i am not sure whether barrier needs to be read, write or both.

i think that the simpler code:

   ptr = (ptr + incr) & mask;

is subject to potential compiler and/or processor "optimization" that
might reduce it back to the problem case of two ops without an
intermediate load/store location. the volatile declaration ought to
prevent the compiler from doing this, and i don't see why a processor
would do this, ever, but clearly i've already been deeply wrong about
this. does anybody know for certain?

--p



___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Paul Davis
On Sun, 2008-10-19 at 09:55 +0200, Paul Davis wrote:

> i am deeply embarrassed but also puzzled.

maybe i don't need to eat any hats except for the "paying insufficient
attention" one. the jack ringbuffer code was ported from my original C++
implementation, which specifically does not have the problem of visible
intermediate state (and, btw, does force the use of atomic loads &
stores).

--p


___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Fons Adriaensen
On Sun, Oct 19, 2008 at 09:55:54AM +0200, Paul Davis wrote:

>1) fons' design, which never actually wraps readptr or writeptr, but
>   masks the address used to access the data buffer
> 
>2) removing the intermediate state's visibility
> 
> i admit to preferring (2) even though i know that with a 64 bit index,
> not wrapping the ptrs is not really a problem.

Another advantage of 1 is that it has no problem with
a completely full buffer, and this doesn't even require
any special code. If access is always in 2^N blocks,
it can store one block more.

> i think that the simpler code:
> 
>ptr = (ptr + incr) & mask;
> 
> is subject to potential compiler and/or processor "optimization" that
> might reduce it back to the problem case of two ops without an
> intermediate load/store location.

You mean _with_ and intermediate store (I hope) ?

Adding a intermediate store would not really be an
'optimisation'... I doubt if any compiler would do
it. But you can prevent it from doing so, see below.

> the volatile declaration ought to  prevent the compiler from
> doing this

It does. Declaring a var volatile means that the compiler MUST
assume that reading or writing it can have side effects, and
that therefore it MUST access the var exactly as specified. For
the new code this means that only the correct value will ever
be written, for the old it forced the compiler to write the
wrong one.

Ciao,

-- 
FA

Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Jack O'Quin
On Sun, Oct 19, 2008 at 2:55 AM, Paul Davis <[EMAIL PROTECTED]> wrote:

> i don't know whether to shoot myself or eat another couple of the
> oft-promised hats.

Don't beat yourself up too badly.  Multiple threads accessing shared memory
is *very* tricky.  Even smart people (like you) get it wrong sometimes.

> so the next question is how best to prevent it. as far as i can see we
> have a couple of proposals:
>
>   1) fons' design, which never actually wraps readptr or writeptr, but
>  masks the address used to access the data buffer
>
>   2) removing the intermediate state's visibility
>
> i admit to preferring (2) even though i know that with a 64 bit index,
> not wrapping the ptrs is not really a problem.

I also prefer (2).

> however, it is not totally clear to me how to prevent an optimizing
> compiler from doing the wrong thing here. unlike the claims made by
> someone involved with portaudio, i believe that it is correct to declare
> the read_ptr and write_ptr volatile, so that the compiler knows that it
> cannot try to be clever about it accesses of the "other" ptr (i.e.
> read_ptr for the writing thread, and vice versa). maybe the comment on
> the use of volatile was based on some idea that i thought it made the
> variables thread safe, which is and never was the case.
>
> i suspect that the safest code looks like:
>
>size_t tmp = (ptr + incr) & mask;
>barrier();
>ptr = tmp;
>
> but i am not sure whether barrier needs to be read, write or both.
>
> i think that the simpler code:
>
>   ptr = (ptr + incr) & mask;
>
> is subject to potential compiler and/or processor "optimization" that
> might reduce it back to the problem case of two ops without an
> intermediate load/store location. the volatile declaration ought to
> prevent the compiler from doing this, and i don't see why a processor
> would do this, ever, but clearly i've already been deeply wrong about
> this. does anybody know for certain?

It is best to avoid assumptions about what some future compiler may
consider an "optimization".  If the register pressure is high at some
point in the program, it may decide to store some value just to free up
register space for other variables.

The "volatile" declaration should remove any need for the compiler
barrier() statement, AFAICT.  Note that barrier() is a compiler
directive, and has no effect on the CPU's ability to reorder cache
operations in an SMP memory hierarchy.

Here is a fairly clear and complete description of the memory barrier
issues:  http://lxr.linux.no/linux/Documentation/memory-barriers.txt

As best I can tell, both ring buffer threads require "general" memory
barriers, because they both read and write (different) shared data.  In
Linux kernel terms, they'd need to use smp_mb(), since the problems
they address are multiprocessor-only.  But, since JACK is not compiled
differently for UP and SMP, the full mb() seems more appropriate.

That stuff makes my head hurt.
-- 
 joq
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Paul Coccoli
On Sun, Oct 19, 2008 at 12:24 AM, Jack O'Quin <[EMAIL PROTECTED]> wrote:
> On Sat, Oct 18, 2008 at 10:45 PM, Paul Coccoli <[EMAIL PROTECTED]> wrote:
>> On Sat, Oct 18, 2008 at 11:29 PM, Jack O'Quin <[EMAIL PROTECTED]> wrote:
>>> This is wrong.  For the single reader, single writer case, atomic operations
>>> are *not* necessary.  The bug, as was already pointed out, is due to storing
>>
>> Let's agree to disagree, then.  Single-reader, single-writer does not
>> automatically make something SMP safe.  There is large body of
>> literature on lock-free data structures that agrees with me; someone
>> posted a link to a collection of those earlier in the thread.
>
> Let's not.  This is not just a matter of opinion.  If you read that 
> literature,
> you will find that the ring buffer *is* safe for the single reader,
> single writer
> case.  In many other SMP situations, atomic operations *are* required,
> but not for ring buffers.

The only time you can get away without atomic ops is on uni-processor.
 Please cite a reference that says otherwise.

Notice that all the fixes proposed all involve removing the "+=" and
using only assignment.
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Fons Adriaensen
On Sun, Oct 19, 2008 at 12:36:55PM -0400, Paul Coccoli wrote:

> The only time you can get away without atomic ops is on uni-processor.
>  Please cite a reference that says otherwise.

Plaese show us how using a non-atomic addition could go wrong.

> Notice that all the fixes proposed all involve removing the "+=" and
> using only assignment.

First, that is not the same as requiring atomic addition.
Second, it is not true. The one I proposed just does the '+='.
If it's wrong show us how.

Ciao,

-- 
FA

Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-19 Thread Stephen Sinclair
On Sun, Oct 19, 2008 at 1:08 PM, Fons Adriaensen <[EMAIL PROTECTED]> wrote:
> On Sun, Oct 19, 2008 at 12:36:55PM -0400, Paul Coccoli wrote:
>
>> The only time you can get away without atomic ops is on uni-processor.
>>  Please cite a reference that says otherwise.
>
> Plaese show us how using a non-atomic addition could go wrong.

On a side note, does anyone know what the performance penalty is (if
any) for using atomic ops?
And does it scale according to number of CPU cores?  What other
factors are there?  I assume the caching architecture makes a big
difference.

I've been using the atomic-ops library from HP for doing various
things lately, and I find it quite nice.  But I have found myself
wondering whether I am paying some kind of penalty.  Of course, I'm
sure it's less than the penalty for using locks.


Steve
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-20 Thread Olivier Guilyardi
Jack O'Quin wrote:
> On Sun, Oct 19, 2008 at 2:55 AM, Paul Davis <[EMAIL PROTECTED]> wrote:
> 
>> i don't know whether to shoot myself or eat another couple of the
>> oft-promised hats.
> 
> Don't beat yourself up too badly.  Multiple threads accessing shared memory
> is *very* tricky.  Even smart people (like you) get it wrong sometimes.

I agree with Jack. I remember a friend of mine once qualifying multi-threading
of "kernel quantization". This is complex but also challenging :)

Although your understanding of these matters is far beyond mine, I think that I
have a good professional experience of coding methodology. And to me the problem
here is indeed methodological: there is a lack of unit tests in JACK.

>> so the next question is how best to prevent it. as far as i can see we
>> have a couple of proposals:
>>
>>   1) fons' design, which never actually wraps readptr or writeptr, but
>>  masks the address used to access the data buffer
>>
>>   2) removing the intermediate state's visibility
>>
>> i admit to preferring (2) even though i know that with a 64 bit index,
>> not wrapping the ptrs is not really a problem.
> 
> I also prefer (2).
> 
>> however, it is not totally clear to me how to prevent an optimizing
>> compiler from doing the wrong thing here. unlike the claims made by
>> someone involved with portaudio, i believe that it is correct to declare
>> the read_ptr and write_ptr volatile, so that the compiler knows that it
>> cannot try to be clever about it accesses of the "other" ptr (i.e.
>> read_ptr for the writing thread, and vice versa). maybe the comment on
>> the use of volatile was based on some idea that i thought it made the
>> variables thread safe, which is and never was the case.
>>
>> i suspect that the safest code looks like:
>>
>>size_t tmp = (ptr + incr) & mask;
>>barrier();
>>ptr = tmp;
>>
>> but i am not sure whether barrier needs to be read, write or both.
>>
>> i think that the simpler code:
>>
>>   ptr = (ptr + incr) & mask;
>>
>> is subject to potential compiler and/or processor "optimization" that
>> might reduce it back to the problem case of two ops without an
>> intermediate load/store location. the volatile declaration ought to
>> prevent the compiler from doing this, and i don't see why a processor
>> would do this, ever, but clearly i've already been deeply wrong about
>> this. does anybody know for certain?
> 
> It is best to avoid assumptions about what some future compiler may
> consider an "optimization".  If the register pressure is high at some
> point in the program, it may decide to store some value just to free up
> register space for other variables.
> 
> The "volatile" declaration should remove any need for the compiler
> barrier() statement, AFAICT.  Note that barrier() is a compiler
> directive, and has no effect on the CPU's ability to reorder cache
> operations in an SMP memory hierarchy.
> 
> Here is a fairly clear and complete description of the memory barrier
> issues:  http://lxr.linux.no/linux/Documentation/memory-barriers.txt
> 
> As best I can tell, both ring buffer threads require "general" memory
> barriers, because they both read and write (different) shared data.  In
> Linux kernel terms, they'd need to use smp_mb(), since the problems
> they address are multiprocessor-only.  But, since JACK is not compiled
> differently for UP and SMP, the full mb() seems more appropriate.
> 
> That stuff makes my head hurt.

Maybe that is because your approach is too theoretical. I've read many theories
about whether memory barriers are needed or not in lock-free ring buffers, and I
think it might lead nowhere without experimenting, that is: write test cases.

Until now, I just couldn't write a test that fail because of the lack of memory
barrier on x86. However, I think I found another bug in Jack ringbuffer, by
writing another test.

It's a bit of a weird test, I call it "bit circle": I start 16 "node" threads.
Each node :
- communicate with its adjacent node through 2 ring buffers
- is responsible for shifting an integer by one bit
- send the shifted result to the next node through a ring buffer
- checks that the value read from previous node is correct.

On my Debian Quad Core and Mac OS X Core Duo boxes, this test fails with Jack's
ringbuffer even with my patch applied. But it succeeds with Portaudio (both with
and without memory barriers), and also with Fons' implementation.

I wish someone could eventually run my test suite on a PowerPC, especially SMP.

The usage is unchanged:
svn co http://svn.samalyse.com/misc/rbtest
cd rbtest
make test

The run time is now shorter, about 2 minutes. Below is the output. You can see
that jack (test-bit-circle-jack) fails the bit circle test, even when patched
(test-bit-circle-jack-fix1). "|" is printed when a node thread starts, and "-"
when a node has read 1000 values, to ensure data is really flowing.

The test-int-array-* tests are the same as my original test. All *-lfq tests 

Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-20 Thread Sean Bolton
Hi Olivier,

On Oct 20, 2008, at 6:13 AM, Olivier Guilyardi wrote:
> Until now, I just couldn't write a test that fail because of the  
> lack of memory
> barrier on x86. However, I think I found another bug in Jack  
> ringbuffer, by
> writing another test.
>
> It's a bit of a weird test, I call it "bit circle": I start 16  
> "node" threads.
> Each node :
> - communicate with its adjacent node through 2 ring buffers
> - is responsible for shifting an integer by one bit
> - send the shifted result to the next node through a ring buffer
> - checks that the value read from previous node is correct.
>
> On my Debian Quad Core and Mac OS X Core Duo boxes, this test fails  
> with Jack's
> ringbuffer even with my patch applied. But it succeeds with  
> Portaudio (both with
> and without memory barriers), and also with Fons' implementation.
>
> I wish someone could eventually run my test suite on a PowerPC,  
> especially SMP.
>
> The usage is unchanged:
> svn co http://svn.samalyse.com/misc/rbtest
> cd rbtest
> make test
>
> The run time is now shorter, about 2 minutes. Below is the output.  
> You can see
> that jack (test-bit-circle-jack) fails the bit circle test, even  
> when patched
> (test-bit-circle-jack-fix1). "|" is printed when a node thread  
> starts, and "-"
> when a node has read 1000 values, to ensure data is really flowing.
>
> The test-int-array-* tests are the same as my original test. All *- 
> lfq tests use
> Fons's Lfq ringbuffer (modified to use char instead of uint32_t) as  
> backend.
>
> ./run-tests.sh bit-circle 512 jack jack-fix1 portaudio portaudio- 
> nobarrier lfq
>
> test-bit-circle-jack - starting (5s max) - buffer size: 512
> || FAILURE: 2 != 514

I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it  
returns
FAILURE: 2 != 0.  Endianness bug? No, but it's a clue...

I don't think you're seeing a *bug* in the jack ringbuffer, just a  
difference in
semantics between it and the PortAudio and lfq imlementations.  Both PA
and lfq mask the index after reading it, so they can completely fill  
the buffer.
The jack implementation masks the stored index, which means it can't
ever completely fill the buffer, or there would be an ambiguity as to  
whether
it was full or empty.  The most it can store is (size - 1) bytes.

In the jack version of your code, at some point, the buffer almost  
fills, and
jack_ringbuffer_write only writes a single byte to the buffer (low  
order on x86,
high order on PPC).  Next time around, it does this again.  Then the  
following
thread reads, not 0x0002 like it should, but either 0x0202 (x86) or  
0x (PPC).

Paul said he prefers storing the masked pointer, but I think Fons & PA's
approach is better: it allows for full use of the buffer.

-Sean

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-20 Thread Pieter Palmers
Sean Bolton wrote:
> Hi Olivier,
> 
> On Oct 20, 2008, at 6:13 AM, Olivier Guilyardi wrote:
>> Until now, I just couldn't write a test that fail because of the  
>> lack of memory
>> barrier on x86. However, I think I found another bug in Jack  
>> ringbuffer, by
>> writing another test.
>>
>> It's a bit of a weird test, I call it "bit circle": I start 16  
>> "node" threads.
>> Each node :
>> - communicate with its adjacent node through 2 ring buffers
>> - is responsible for shifting an integer by one bit
>> - send the shifted result to the next node through a ring buffer
>> - checks that the value read from previous node is correct.
>>
>> On my Debian Quad Core and Mac OS X Core Duo boxes, this test fails  
>> with Jack's
>> ringbuffer even with my patch applied. But it succeeds with  
>> Portaudio (both with
>> and without memory barriers), and also with Fons' implementation.
>>
>> I wish someone could eventually run my test suite on a PowerPC,  
>> especially SMP.
>>
>> The usage is unchanged:
>> svn co http://svn.samalyse.com/misc/rbtest
>> cd rbtest
>> make test
>>
>> The run time is now shorter, about 2 minutes. Below is the output.  
>> You can see
>> that jack (test-bit-circle-jack) fails the bit circle test, even  
>> when patched
>> (test-bit-circle-jack-fix1). "|" is printed when a node thread  
>> starts, and "-"
>> when a node has read 1000 values, to ensure data is really flowing.
>>
>> The test-int-array-* tests are the same as my original test. All *- 
>> lfq tests use
>> Fons's Lfq ringbuffer (modified to use char instead of uint32_t) as  
>> backend.
>>
>> ./run-tests.sh bit-circle 512 jack jack-fix1 portaudio portaudio- 
>> nobarrier lfq
>>
>> test-bit-circle-jack - starting (5s max) - buffer size: 512
>> || FAILURE: 2 != 514
> 
> I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it  
> returns
> FAILURE: 2 != 0.  Endianness bug? No, but it's a clue...
> 
> I don't think you're seeing a *bug* in the jack ringbuffer, just a  
> difference in
> semantics between it and the PortAudio and lfq imlementations.  Both PA
> and lfq mask the index after reading it, so they can completely fill  
> the buffer.
> The jack implementation masks the stored index, which means it can't
> ever completely fill the buffer, or there would be an ambiguity as to  
> whether
> it was full or empty.  The most it can store is (size - 1) bytes.
> 
> In the jack version of your code, at some point, the buffer almost  
> fills, and
> jack_ringbuffer_write only writes a single byte to the buffer (low  
> order on x86,
> high order on PPC).  Next time around, it does this again.  Then the  
> following
> thread reads, not 0x0002 like it should, but either 0x0202 (x86) or  
> 0x (PPC).
> 

I can confirm this, since the following will make the test 'succeed', 
although it results in a very slow progress once the buffers are full.

Index: test-bit-circle.c
===
--- test-bit-circle.c   (revision 314)
+++ test-bit-circle.c   (working copy)
@@ -50,6 +50,9 @@

  if (value)
  {
+  while(jack_ringbuffer_write_space(rb[1]) < 2) {
+usleep(1000);
+  }
jack_ringbuffer_write (rb[1], (char *) &value, sizeof (uint16_t));
value = 0;
  }

Pieter
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-20 Thread Olivier Guilyardi
Hi,

Pieter Palmers a écrit :
> Sean Bolton wrote:
>>
>>> test-bit-circle-jack - starting (5s max) - buffer size: 512
>>> || FAILURE: 2 != 514
>>
>> I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it  
>> returns
>> FAILURE: 2 != 0.  Endianness bug? No, but it's a clue...

Do you have a PPC? Which one? Can you please paste the whole output of "make 
test"?

>> I don't think you're seeing a *bug* in the jack ringbuffer, just a  
>> difference in
>> semantics between it and the PortAudio and lfq imlementations.  Both PA
>> and lfq mask the index after reading it, so they can completely fill  
>> the buffer.
>> The jack implementation masks the stored index, which means it can't
>> ever completely fill the buffer, or there would be an ambiguity as to  
>> whether
>> it was full or empty.  The most it can store is (size - 1) bytes.
>>
>> In the jack version of your code, at some point, the buffer almost  
>> fills, and
>> jack_ringbuffer_write only writes a single byte to the buffer (low  
>> order on x86,
>> high order on PPC).  Next time around, it does this again.  Then the  
>> following
>> thread reads, not 0x0002 like it should, but either 0x0202 (x86) or  
>> 0x (PPC).
>>
> 
> I can confirm this, since the following will make the test 'succeed', 
> although it results in a very slow progress once the buffers are full.
> 
> Index: test-bit-circle.c
> ===
> --- test-bit-circle.c   (revision 314)
> +++ test-bit-circle.c   (working copy)
> @@ -50,6 +50,9 @@
> 
>  if (value)
>  {
> +  while(jack_ringbuffer_write_space(rb[1]) < 2) {
> +usleep(1000);
> +  }
>jack_ringbuffer_write (rb[1], (char *) &value, sizeof (uint16_t));
>value = 0;
>  }

I know my code performs much more write than read operations, I did that on 
purpose, while playing with it.

Now, back to reality, let's say that you have a thread that synthetizes some 
sound, puts it into a ringbuffer, for an audio thread to read and play it. 
Let's 
also say that the synth thread performs much faster than the audio thread, so 
that the buffer is almost always full. I think this is quite realistic.

In these conditions, if I understand correctly the result of this test, there 
is 
a chance for the output of the ringbuffer to differ from its input. Isn't that 
some sort of bug?

-- 
   Olivier Guilyardi / Samalyse
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-20 Thread Olivier Guilyardi
Olivier Guilyardi a écrit :
> Hi,
> 
> Pieter Palmers a écrit :
>> Sean Bolton wrote:
>>>
 test-bit-circle-jack - starting (5s max) - buffer size: 512
 || FAILURE: 2 != 514
>>>
>>> I thought it suspicious that 514 = 2 + 2 * 256, while on a PPC it  
>>> returns
>>> FAILURE: 2 != 0.  Endianness bug? No, but it's a clue...
> 
> Do you have a PPC? Which one? Can you please paste the whole output of 
> "make test"?
> 
>>> I don't think you're seeing a *bug* in the jack ringbuffer, just a  
>>> difference in
>>> semantics between it and the PortAudio and lfq imlementations.  Both PA
>>> and lfq mask the index after reading it, so they can completely fill  
>>> the buffer.
>>> The jack implementation masks the stored index, which means it can't
>>> ever completely fill the buffer, or there would be an ambiguity as 
>>> to  whether
>>> it was full or empty.  The most it can store is (size - 1) bytes.
>>>
>>> In the jack version of your code, at some point, the buffer almost  
>>> fills, and
>>> jack_ringbuffer_write only writes a single byte to the buffer (low  
>>> order on x86,
>>> high order on PPC).  Next time around, it does this again.  Then the  
>>> following
>>> thread reads, not 0x0002 like it should, but either 0x0202 (x86) or  
>>> 0x (PPC).
>>>
>>
>> I can confirm this, since the following will make the test 'succeed', 
>> although it results in a very slow progress once the buffers are full.
>>
>> Index: test-bit-circle.c
>> ===
>> --- test-bit-circle.c   (revision 314)
>> +++ test-bit-circle.c   (working copy)
>> @@ -50,6 +50,9 @@
>>
>>  if (value)
>>  {
>> +  while(jack_ringbuffer_write_space(rb[1]) < 2) {
>> +usleep(1000);
>> +  }
>>jack_ringbuffer_write (rb[1], (char *) &value, sizeof (uint16_t));
>>value = 0;
>>  }
> 
> I know my code performs much more write than read operations, I did that 
> on purpose, while playing with it.
> 
> Now, back to reality, let's say that you have a thread that synthetizes 
> some sound, puts it into a ringbuffer, for an audio thread to read and 
> play it. Let's also say that the synth thread performs much faster than 
> the audio thread, so that the buffer is almost always full. I think this 
> is quite realistic.
> 
> In these conditions, if I understand correctly the result of this test, 
> there is a chance for the output of the ringbuffer to differ from its 
> input. Isn't that some sort of bug?
> 

Hmm, wait, I think I understand your point.. With the Jack implementation, in a 
buffer of 16 bytes, you can really put 15 bytes. I didn't notice this. And 
since 
I do not check if there's enough space to write my 16-bits integer, when the 
buffer is almost full, it only writes one byte...

So, well, yes it looks like this is no bug... Sorry about confusion.

I've fixed this (r315). This test now indeed passes on all backends.

What about the test-int-array-* family of tests on PPC?

-- 
   Olivier Guilyardi / Samalys
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-22 Thread Sean Bolton
On Oct 20, 2008, at 3:18 PM, Olivier Guilyardi wrote:
> What about the test-int-array-* family of tests on PPC?

They all pass on my uniprocessor G4, but that doesn't
really tell us much.  Anybody got a multiprocessor PPC?

This stuff makes my head hurt, too, but as best I can
figure out, I believe the jack code needs memory
barriers on several non-x86 architectures with weak
memory ordering, PPC probably being the most
common.  It appears to me that without barriers,
there's a small chance reordered writes on one
processor could result in another processor reading
invalid data:

CPU1: CPU2:
write index   ---
write data2   read index
write data3   read data1 <- invalid
write data1   read data2
---   read data3



x
Paul E. McKenney

Memory Ordering in Modern Microprocessors, Part I
http://www.linuxjournal.com/article/8211

Memory Ordering in Modern Microprocessors, Part II
http://www.linuxjournal.com/article/8212

Linus Torvalds
Re: Memory barriers and spin_unlock safety
http://lkml.org/lkml/2006/3/3/142

Parts of this *thread*
Re: Re: [liblf-dev] Possible lock-free ring buffer: msg#00152

http://techweb.rfa.org/pipermail/portaudio/2006-May/005661.html



___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-22 Thread Olivier Guilyardi
Sean Bolton wrote:
> On Oct 20, 2008, at 3:18 PM, Olivier Guilyardi wrote:
>> What about the test-int-array-* family of tests on PPC?
> 
> They all pass on my uniprocessor G4, but that doesn't
> really tell us much.  Anybody got a multiprocessor PPC?

I think it might be worth to try the following on your G4:

- run the test *much* longer (change the sleep call at the end of
test-int-array.c). Having experimented a lot with my tests, I know that failures
appear after a truly random amount of time.

- induce entropy, move your mouse, compose music, make a coffee, kick your
computer... ;-)

- experiment with compiler optimization (-O1, -O2 or -O3, ... in the Makefile's
CFLAGS). I'm unsure about this, but it's been mentioned many times on this
thread and I think it's worth trying.

PS: I'm CC-ing this message to jack-devel as Joern (a great community engineer I
know ;) recommended.

-- 
  Olivier Guilyardi / Samalyse
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-27 Thread Jesse Chappell
For the records, here is the output from a dual IBM Cell system
(QS-21), each PowerPC unit is exposed with two cores.

jlc

$ cat /proc/cpuinfo

processor   : 0
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 5.1 (pvr 0070 0501)

processor   : 1
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 5.1 (pvr 0070 0501)

processor   : 2
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 5.1 (pvr 0070 0501)

processor   : 3
cpu : Cell Broadband Engine, altivec supported
clock   : 3200.00MHz
revision: 5.1 (pvr 0070 0501)

timebase: 14318000
platform: Cell
machine : CHRP MCS,-n/a



=
./run-tests.sh bit-circle 512 jack jack-fix1 portaudio portaudio-nobarrier lfq

test-bit-circle-jack - starting (5s max) - buffer size: 512
||-||--- SUCCESS

test-bit-circle-jack-fix1 - starting (5s max) - buffer size: 512
---- SUCCESS

test-bit-circle-portaudio - starting (5s max) - buffer size: 512
||-|||-||--| SUCCESS

test-bit-circle-portaudio-nobarrier - starting (5s max) - buffer size: 512
-|-|-||-||--||-- SUCCESS

test-bit-circle-lfq - starting (5s max) - buffer size: 512
||-|||-||-|- SUCCESS

./run-tests.sh int-array 16 jack jack-fix1 portaudio portaudio-nobarrier lfq

test-int-array-jack - starting (120s max) - array/buffer size: 8/16
[reader started] [writer started] [flowing] 29364 != 29360 at offset 0
FAILURE in chunk 702810

test-int-array-jack-fix1 - starting (120s max) - array/buffer size: 8/16
[reader started] [writer started] [flowing] SUCCESS

test-int-array-portaudio - starting (120s max) - array/buffer size: 8/16
[reader started] [writer started] [flowing] SUCCESS

test-int-array-portaudio-nobarrier - starting (120s max) -
array/buffer size: 8/16
[reader started] [writer started] [flowing] SUCCESS

test-int-array-lfq - starting (120s max) - array/buffer size: 8/16
[reader started] [writer started] [flowing] SUCCESS
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] [LAU] Simple, easy multithreaded circular buffer library for Linux?

2008-10-27 Thread Jesse Chappell
On Mon, Oct 27, 2008 at 4:38 PM, Jesse Chappell <[EMAIL PROTECTED]> wrote:
> For the records, here is the output from a dual IBM Cell system
> (QS-21), each PowerPC unit is exposed with two cores.

This is actually a QS-20, sorry.

jlc
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev