Re: Multi-threading implementation equivalence in Beam

Lukasz Cwik Mon, 20 Jun 2016 16:10:44 -0700

You are correct, an in memory database is outside the scope/knowledge of
the runner and it will not be able to move any side effects over. For
example, lets say your processing some elements X, Y, Z in one bundle on
one machine and processing Y fails. The bundle may be retried on another
machine where your changes to X may not exist. Or the bundle may be split
such that X and Z is processed on one machine and Y on yet another machine.


If the reason for using an in-memory database is just caching and you can
reload/recreate the cached entries than this should be fine, you'll just
suffer cache misses elsewhere.
If the reason is for caching previously seen elements which you will output
later or write like side effects than this can disappear if the bundle
processing is moved to another machine.

Its not that in memory databases can't be used, they just can't be relied
on persistent state.

On Mon, Jun 20, 2016 at 3:36 PM, amir bahmanyari <[email protected]>
wrote:

> Wonderful. Thanks Lukaz.
> Have one question. The statement in that page "In addition, *your DoFn
> should not rely on any persistent state from invocation to invocation.*".
> I am using in-memory db such as Redis or Aerospike for intermediate look
> ups etc.
> Is this what the above statement referring to: dont use in-memory dbs?
> Thanks again.
>
>
> ------------------------------
> *From:* Lukasz Cwik <[email protected]>
> *To:* [email protected]; amir bahmanyari <[email protected]>
>
> *Sent:* Monday, June 20, 2016 1:08 PM
>
> *Subject:* Re: Multi-threading implementation equivalence in Beam
>
> Threading/parallelism is up to the runner and does not map 1-1 to the java
> memory model since most runners will execute in a distributed manner. In
> general, runners will attempt to break up the work as evenly as possible
> and schedule work across multiple machines / cpu cores at all times to
> maximize the throughput / minimize time for execution of the job This is
> abstracted away much by getting users to write DoFns that apply with ParDo.
> Please take a look at this explanation about ParDo (
> https://cloud.google.com/dataflow/model/par-do) to get a better
> understanding of its usage and as a place to look at some examples.
>
> On Mon, Jun 20, 2016 at 12:44 PM, amir bahmanyari <[email protected]>
> wrote:
>
> Thanks JB.
> I am executing FlinkPipelineRunner...& later will experirnt the same with
> SparkRunner....any examples pls?
> Cheers
>
> ------------------------------
> *From:* Jean-Baptiste Onofré <[email protected]>
> *To:* [email protected]
> *Sent:* Monday, June 20, 2016 12:35 PM
> *Subject:* Re: Multi-threading implementation equivalence in Beam
>
> Hi Amir,
>
> the DirectPipelineRunner uses multi-thread to achieve ParDo execution
> for instance.
>
> You mean example of Beam pipeline code ? Or example of runner ?
>
> Regards
> JB
>
> On 06/20/2016 09:25 PM, amir bahmanyari wrote:
> > Hi Colleagues,
> > Hope you all had a great weekend. Another novice question :-)
> > Is there a pipeline parallelism/threading model provide by Beam that
> > equates the multi-threading model in Java for instance?
> > Any examples if so?
> > Thanks again,
> > Amir
>
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>
>
>
>

Re: Multi-threading implementation equivalence in Beam

Reply via email to