Re: Ensuring a task does not get executed concurrently

2023-06-12 Thread Robert Bradshaw via dev
If you absolutely cannot tolerate concurrency an external locking mechanism
is required. While a distributed system often waits for a work item to fail
before trying it, this is not always the case (e.g. backup workers may be
scheduled and whoever finishes first is determined to be the successful
attempt). Worse, it can even be the case where the master thinks a task has
failed (e.g. due to loss of contact) and re-assign the item to
another worker, when in fact the original worker has not fully died (e.g.
it simply lost network connectivity, or entered a bad-but-not-fatal state
where a user-code thread continues on--we call these zombie workers and
though they're uncommon they're nigh impossible to rule out).

On Mon, Jun 12, 2023 at 11:36 AM Bruno Volpato via dev 
wrote:

> Hi Stephan,
>
> I am not sure if this is the best way to achieve this, but I've seen
> parallelism being limited by using state / KV and limiting the number of
> keys.
> In your case, you could have the same key for both non concurrency-safe
> operations and when using state, the Beam model will guarantee that they
> aren't concurrently executed.
>
> This blog post may be helpful:
> https://beam.apache.org/blog/stateful-processing/
>
>
>
>
> On Mon, Jun 12, 2023 at 2:21 PM Stephan Hoyer via dev 
> wrote:
>
>> Can the Beam data model (specifically the Python SDK) support executing
>> functions that are idempotent but not concurrency-safe?
>>
>> I am thinking of a task like setting up a database (or in my case, a Zarr
>>  store in Xarray-Beam
>> ) where it is not safe to run
>> setup concurrently, but if the whole operation fails it is safe to retry.
>>
>> I recognize that a better model would be to use entirely atomic
>> operations, but sometimes this can be challenging to guarantee for tools
>> that were not designed with parallel computing in mind.
>>
>> Cheers,
>> Stephan
>>
>


Re: Ensuring a task does not get executed concurrently

2023-06-12 Thread Bruno Volpato via dev
Hi Stephan,

I am not sure if this is the best way to achieve this, but I've seen
parallelism being limited by using state / KV and limiting the number of
keys.
In your case, you could have the same key for both non concurrency-safe
operations and when using state, the Beam model will guarantee that they
aren't concurrently executed.

This blog post may be helpful:
https://beam.apache.org/blog/stateful-processing/




On Mon, Jun 12, 2023 at 2:21 PM Stephan Hoyer via dev 
wrote:

> Can the Beam data model (specifically the Python SDK) support executing
> functions that are idempotent but not concurrency-safe?
>
> I am thinking of a task like setting up a database (or in my case, a Zarr
>  store in Xarray-Beam
> ) where it is not safe to run
> setup concurrently, but if the whole operation fails it is safe to retry.
>
> I recognize that a better model would be to use entirely atomic
> operations, but sometimes this can be challenging to guarantee for tools
> that were not designed with parallel computing in mind.
>
> Cheers,
> Stephan
>


Ensuring a task does not get executed concurrently

2023-06-12 Thread Stephan Hoyer via dev
Can the Beam data model (specifically the Python SDK) support executing
functions that are idempotent but not concurrency-safe?

I am thinking of a task like setting up a database (or in my case, a Zarr
 store in Xarray-Beam
) where it is not safe to run setup
concurrently, but if the whole operation fails it is safe to retry.

I recognize that a better model would be to use entirely atomic operations,
but sometimes this can be challenging to guarantee for tools that were not
designed with parallel computing in mind.

Cheers,
Stephan