Re: Coroutines - was Re: Daffodil SAX API Proposal

2020-09-16 Thread Beckerle, Mike
I checked and the one on the gist I wrote is right. It also has a SAX producer 
consumer example worked out that wasn't in the original stuff that was part of 
daffodil.



From: Beckerle, Mike 
Sent: Wednesday, September 16, 2020 4:07 PM
To: dev@daffodil.apache.org 
Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal

Yes there is a reason not to parallelize. It makes profiling to determine where 
time is spent far harder. It makes debugging far harder as it introduces the 
possibility of interactions across threads. Coroutines has no concurrency, so 
there are no race conditions, no possible interactions. It's still ordinary 
code, not "multithreaded" code.

The coroutines library is the one I wrote that used to be part of daffodil-lib, 
but we removed it when we no longer needed it at the time.



From: Steve Lawrence 
Sent: Wednesday, September 16, 2020 10:15 AM
To: dev@daffodil.apache.org 
Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal

By "coroutines library", you're talking about the one on your gist that
you wrote?

https://gist.github.com/mbeckerle/312474bac9bee9102438c160890b6539

It would be nice if batching we're an option, at least so we could test
it and see if there is a difference.

Also, although we're not trying to go faster by overlapping, perhaps
this is something we might want to consider? Is there a reason to not
parallelize the SAX thread filling up queue and the unparse thread
reading from that queue? I guess if one thread is much faster than the
other then there's really not much benefit and one thread might just
spin waiting for the other to read/write and event? Does your coroutine
library do something to prevent this from happening?


On 9/16/20 10:00 AM, Beckerle, Mike wrote:
> The point of the coroutines library is that doing something "as simple as" 
> just an array blocking queue, etc. with threads is always problematic.
>
> Also, an important point. The objective here is "no parallelism". We're not 
> trying to go faster by overlapping things. We're just trying to change stacks 
> so we can run two different stack contexts.  Ideally this would all be a 
> single thread with stack switching. JVMs just don't have that.
>
> I think the coroutines library is pretty simple to use, and could be adapted 
> to batch up requests to reduce overhead if we want.
>
>
> 
> From: Steve Lawrence 
> Sent: Wednesday, September 16, 2020 8:12 AM
> To: dev@daffodil.apache.org 
> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>
> As I recall, the libraries that use things like annotations end up
> changing the return types of all the callers, which ends up leaking in
> to he API and changing it, so I don't think any of those solutions will
> work.
>
> I think we have to use Threads, where the main thread is the caller
> using the SAX API, and when unparse is called we spawn off the actual
> unparse in a new thread. And there's some data structure shared between
> these threads that contains event information.
>
> I think it really just comes down to which of the various
> implementations to use. I'm not too familiar with Mike's Coroutine
> class. Mike, can you maybe discuss what advantages this has over say
> just spawning a thread and sharing something like an ArrayBlockingQueue
> to pass event information between the threads? This seems like the
> simplest option, and allows tuning the size of the queue, which should
> allow batching of events and minimize context switching between threads.
>
> - Steve
>
> On 9/15/20 10:15 PM, Olabusayo Kilo wrote:
>> I don't think we came to a conclusion on which path we should take. If I
>> understand correctly, our options seem to be between the Thread-based
>> Coroutine library (#3; which has a bit of overhead) and the
>> Continuations library (#2; which is not yet supported for 2.13 and
>> requires the suspendable annotation). I wanted to check in to see if
>> there was a preferred one that I could focus my effort on?
>>
>> On 4/24/20 9:28 AM, Beckerle, Mike wrote:
>>> A further thought on this. The overhead difference between
>>> continuations and threads was 1 to 4 (roughly).
>>>
>>> If you add real workload to what happens on either side of that
>>> producer-consumer relationship, I bet this difference disappears into
>>> the noise, not because it becomes more efficient due to less
>>> contention, but because it's such a tiny fraction of the actual work
>>> being done.
>>>
>>> The Thread-based coroutines library, I have a copy of in a separate
>>> sandbox, so if you want to grab that I'll get it over to you so you
>>> don't have to dig for it.
>>> 
>>> From: Beckerle, Mike 
>>> Sent: Friday, April 24, 2020 8:53 AM
>>> To: dev@daffodil.apache.org 
>>> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>>>
>>> That's really informative and confirms intuition that using threads
>>> 

Re: Coroutines - was Re: Daffodil SAX API Proposal

2020-09-16 Thread Beckerle, Mike
Yes there is a reason not to parallelize. It makes profiling to determine where 
time is spent far harder. It makes debugging far harder as it introduces the 
possibility of interactions across threads. Coroutines has no concurrency, so 
there are no race conditions, no possible interactions. It's still ordinary 
code, not "multithreaded" code.

The coroutines library is the one I wrote that used to be part of daffodil-lib, 
but we removed it when we no longer needed it at the time.



From: Steve Lawrence 
Sent: Wednesday, September 16, 2020 10:15 AM
To: dev@daffodil.apache.org 
Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal

By "coroutines library", you're talking about the one on your gist that
you wrote?

https://gist.github.com/mbeckerle/312474bac9bee9102438c160890b6539

It would be nice if batching we're an option, at least so we could test
it and see if there is a difference.

Also, although we're not trying to go faster by overlapping, perhaps
this is something we might want to consider? Is there a reason to not
parallelize the SAX thread filling up queue and the unparse thread
reading from that queue? I guess if one thread is much faster than the
other then there's really not much benefit and one thread might just
spin waiting for the other to read/write and event? Does your coroutine
library do something to prevent this from happening?


On 9/16/20 10:00 AM, Beckerle, Mike wrote:
> The point of the coroutines library is that doing something "as simple as" 
> just an array blocking queue, etc. with threads is always problematic.
>
> Also, an important point. The objective here is "no parallelism". We're not 
> trying to go faster by overlapping things. We're just trying to change stacks 
> so we can run two different stack contexts.  Ideally this would all be a 
> single thread with stack switching. JVMs just don't have that.
>
> I think the coroutines library is pretty simple to use, and could be adapted 
> to batch up requests to reduce overhead if we want.
>
>
> 
> From: Steve Lawrence 
> Sent: Wednesday, September 16, 2020 8:12 AM
> To: dev@daffodil.apache.org 
> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>
> As I recall, the libraries that use things like annotations end up
> changing the return types of all the callers, which ends up leaking in
> to he API and changing it, so I don't think any of those solutions will
> work.
>
> I think we have to use Threads, where the main thread is the caller
> using the SAX API, and when unparse is called we spawn off the actual
> unparse in a new thread. And there's some data structure shared between
> these threads that contains event information.
>
> I think it really just comes down to which of the various
> implementations to use. I'm not too familiar with Mike's Coroutine
> class. Mike, can you maybe discuss what advantages this has over say
> just spawning a thread and sharing something like an ArrayBlockingQueue
> to pass event information between the threads? This seems like the
> simplest option, and allows tuning the size of the queue, which should
> allow batching of events and minimize context switching between threads.
>
> - Steve
>
> On 9/15/20 10:15 PM, Olabusayo Kilo wrote:
>> I don't think we came to a conclusion on which path we should take. If I
>> understand correctly, our options seem to be between the Thread-based
>> Coroutine library (#3; which has a bit of overhead) and the
>> Continuations library (#2; which is not yet supported for 2.13 and
>> requires the suspendable annotation). I wanted to check in to see if
>> there was a preferred one that I could focus my effort on?
>>
>> On 4/24/20 9:28 AM, Beckerle, Mike wrote:
>>> A further thought on this. The overhead difference between
>>> continuations and threads was 1 to 4 (roughly).
>>>
>>> If you add real workload to what happens on either side of that
>>> producer-consumer relationship, I bet this difference disappears into
>>> the noise, not because it becomes more efficient due to less
>>> contention, but because it's such a tiny fraction of the actual work
>>> being done.
>>>
>>> The Thread-based coroutines library, I have a copy of in a separate
>>> sandbox, so if you want to grab that I'll get it over to you so you
>>> don't have to dig for it.
>>> 
>>> From: Beckerle, Mike 
>>> Sent: Friday, April 24, 2020 8:53 AM
>>> To: dev@daffodil.apache.org 
>>> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>>>
>>> That's really informative and confirms intuition that using threads
>>> really hurts performance when all you need is a stack switch.
>>>
>>> In this case reducing contention should reduce total work, but that
>>> depends on how carefully the queue is implemented. If it is a single
>>> lock it may not matter.
>>>
>>> We actually dont care about faster through parallelism because we
>>> should assume the machine is already saturated 

Re: Coroutines - was Re: Daffodil SAX API Proposal

2020-09-16 Thread Beckerle, Mike
The point of the coroutines library is that doing something "as simple as" just 
an array blocking queue, etc. with threads is always problematic.

Also, an important point. The objective here is "no parallelism". We're not 
trying to go faster by overlapping things. We're just trying to change stacks 
so we can run two different stack contexts.  Ideally this would all be a single 
thread with stack switching. JVMs just don't have that.

I think the coroutines library is pretty simple to use, and could be adapted to 
batch up requests to reduce overhead if we want.



From: Steve Lawrence 
Sent: Wednesday, September 16, 2020 8:12 AM
To: dev@daffodil.apache.org 
Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal

As I recall, the libraries that use things like annotations end up
changing the return types of all the callers, which ends up leaking in
to he API and changing it, so I don't think any of those solutions will
work.

I think we have to use Threads, where the main thread is the caller
using the SAX API, and when unparse is called we spawn off the actual
unparse in a new thread. And there's some data structure shared between
these threads that contains event information.

I think it really just comes down to which of the various
implementations to use. I'm not too familiar with Mike's Coroutine
class. Mike, can you maybe discuss what advantages this has over say
just spawning a thread and sharing something like an ArrayBlockingQueue
to pass event information between the threads? This seems like the
simplest option, and allows tuning the size of the queue, which should
allow batching of events and minimize context switching between threads.

- Steve

On 9/15/20 10:15 PM, Olabusayo Kilo wrote:
> I don't think we came to a conclusion on which path we should take. If I
> understand correctly, our options seem to be between the Thread-based
> Coroutine library (#3; which has a bit of overhead) and the
> Continuations library (#2; which is not yet supported for 2.13 and
> requires the suspendable annotation). I wanted to check in to see if
> there was a preferred one that I could focus my effort on?
>
> On 4/24/20 9:28 AM, Beckerle, Mike wrote:
>> A further thought on this. The overhead difference between
>> continuations and threads was 1 to 4 (roughly).
>>
>> If you add real workload to what happens on either side of that
>> producer-consumer relationship, I bet this difference disappears into
>> the noise, not because it becomes more efficient due to less
>> contention, but because it's such a tiny fraction of the actual work
>> being done.
>>
>> The Thread-based coroutines library, I have a copy of in a separate
>> sandbox, so if you want to grab that I'll get it over to you so you
>> don't have to dig for it.
>> 
>> From: Beckerle, Mike 
>> Sent: Friday, April 24, 2020 8:53 AM
>> To: dev@daffodil.apache.org 
>> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>>
>> That's really informative and confirms intuition that using threads
>> really hurts performance when all you need is a stack switch.
>>
>> In this case reducing contention should reduce total work, but that
>> depends on how carefully the queue is implemented. If it is a single
>> lock it may not matter.
>>
>> We actually dont care about faster through parallelism because we
>> should assume the machine is already saturated with work. We want to
>> reduce total amount of work done.
>>
>>
>>
>>
>> 
>> From: Steve Lawrence 
>> Sent: Friday, April 24, 2020 8:02:37 AM
>> To: dev@daffodil.apache.org 
>> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>>
>> I decided to look at performance of three potential options to see if
>> that would rule anything out. I looked at 1) coroutines 2) continuations
>> 3) threads with BlockingQueue. For each of these, I modified the gist to
>> remove printlns and use a different producer consumer model (which is
>> actually very straightforward in we come across other alternatives to
>> test). So everything is the same except for how the SAX content handler
>> interacts with the custom InfosetInputter. For the performance numbers
>> below, I created enough "events" in a loop so that rate of events
>> remained roughly the same as I increased the number of events.
>>
>> 1) coroutines
>>
>> It turns out the coroutines library has a limitation where the
>> yieldval() call must be directly inside the coroutine{} block. This is
>> basically a non-starter for us, since the entire unparse call needs to
>> be a coroutine, and the yieldval call happens way down the stack. So not
>> only does this not have any active development, it functionally won't
>> even work for us.
>>
>> 2) continuations
>>
>> 16.50 million events per second
>>
>> 3) thread with BlockingQueue
>>
>> I think this is similar to the Coroutine library you wrote for Daffodil
>> (though it looks like it's been removed, we can 

Re: Coroutines - was Re: Daffodil SAX API Proposal

2020-09-16 Thread Steve Lawrence
As I recall, the libraries that use things like annotations end up
changing the return types of all the callers, which ends up leaking in
to he API and changing it, so I don't think any of those solutions will
work.

I think we have to use Threads, where the main thread is the caller
using the SAX API, and when unparse is called we spawn off the actual
unparse in a new thread. And there's some data structure shared between
these threads that contains event information.

I think it really just comes down to which of the various
implementations to use. I'm not too familiar with Mike's Coroutine
class. Mike, can you maybe discuss what advantages this has over say
just spawning a thread and sharing something like an ArrayBlockingQueue
to pass event information between the threads? This seems like the
simplest option, and allows tuning the size of the queue, which should
allow batching of events and minimize context switching between threads.

- Steve

On 9/15/20 10:15 PM, Olabusayo Kilo wrote:
> I don't think we came to a conclusion on which path we should take. If I
> understand correctly, our options seem to be between the Thread-based
> Coroutine library (#3; which has a bit of overhead) and the
> Continuations library (#2; which is not yet supported for 2.13 and
> requires the suspendable annotation). I wanted to check in to see if
> there was a preferred one that I could focus my effort on?
> 
> On 4/24/20 9:28 AM, Beckerle, Mike wrote:
>> A further thought on this. The overhead difference between
>> continuations and threads was 1 to 4 (roughly).
>>
>> If you add real workload to what happens on either side of that
>> producer-consumer relationship, I bet this difference disappears into
>> the noise, not because it becomes more efficient due to less
>> contention, but because it's such a tiny fraction of the actual work
>> being done.
>>
>> The Thread-based coroutines library, I have a copy of in a separate
>> sandbox, so if you want to grab that I'll get it over to you so you
>> don't have to dig for it.
>> 
>> From: Beckerle, Mike 
>> Sent: Friday, April 24, 2020 8:53 AM
>> To: dev@daffodil.apache.org 
>> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>>
>> That's really informative and confirms intuition that using threads
>> really hurts performance when all you need is a stack switch.
>>
>> In this case reducing contention should reduce total work, but that
>> depends on how carefully the queue is implemented. If it is a single
>> lock it may not matter.
>>
>> We actually dont care about faster through parallelism because we
>> should assume the machine is already saturated with work. We want to
>> reduce total amount of work done.
>>
>>
>>
>>
>> 
>> From: Steve Lawrence 
>> Sent: Friday, April 24, 2020 8:02:37 AM
>> To: dev@daffodil.apache.org 
>> Subject: Re: Coroutines - was Re: Daffodil SAX API Proposal
>>
>> I decided to look at performance of three potential options to see if
>> that would rule anything out. I looked at 1) coroutines 2) continuations
>> 3) threads with BlockingQueue. For each of these, I modified the gist to
>> remove printlns and use a different producer consumer model (which is
>> actually very straightforward in we come across other alternatives to
>> test). So everything is the same except for how the SAX content handler
>> interacts with the custom InfosetInputter. For the performance numbers
>> below, I created enough "events" in a loop so that rate of events
>> remained roughly the same as I increased the number of events.
>>
>> 1) coroutines
>>
>> It turns out the coroutines library has a limitation where the
>> yieldval() call must be directly inside the coroutine{} block. This is
>> basically a non-starter for us, since the entire unparse call needs to
>> be a coroutine, and the yieldval call happens way down the stack. So not
>> only does this not have any active development, it functionally won't
>> even work for us.
>>
>> 2) continuations
>>
>> 16.50 million events per second
>>
>> 3) thread with BlockingQueue
>>
>> I think this is similar to the Coroutine library you wrote for Daffodil
>> (though it looks like it's been removed, we can probably find it in git
>> the history if we want). This runs the unparse method in a thread and
>> has a blocking queue that the producer pushes to and the consumer takes
>> from. I tested with different queue sizes to see how that affects
>> performance:
>>
>>    size  rate
>>   1  0.14 million events per second
>>  10  1.36 million events per second
>>     100  3.18 million events per second
>>    1000  3.16 million events per second
>> 10  3.09 million events per second
>>
>> So this BlockinQueue approach is quite a bit slower, and definitely
>> requires batching events to be somewhat performant. I guess this
>> slowness makes sens as this approach creates a thread for the unparse,
>> has different threads blocking on this queue, and also