[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
Hi everyone,

I'd like to ask how does Spark (or more generally, distributed computing 
engines) handle RNGs? High-level speaking, there are two ways,

  1.  Use a single RNG on the driver and random numbers generating on each work 
makes request to the single RNG on the driver.
  2.  Use a separate RNG on each worker.

If the 2nd approach above is used, may I ask how does Spark seed RNGs on 
different works to ensure the overall quality of random number generating?


Best,



Ben Du

Personal Blog | GitHub | 
Bitbucket | Docker 
Hub


[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
Hi everyone,

I'd like to ask how does Spark (or more generally, distributed computing 
engines) handle RNGs? High-level speaking, there are two ways,

  1.  Use a single RNG on the driver and random numbers generating on each work 
makes request to the single RNG on the driver.
  2.  Use a separate RNG on each worker.

If the 2nd approach above is used, may I ask how does Spark seed RNGs on 
different works to ensure the overall quality of random number generating?


Best,



Ben Du

Personal Blog | GitHub | 
Bitbucket | Docker 
Hub


Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Sean Owen
The 2nd approach. Spark doesn't work in the 1st way in any context - the
driver and executor processes do not cooperate during execution.
Operations on the executor will generally calculate and store a seed once,
and use that in RNGs, to make its computation reproducible.

On Mon, Oct 4, 2021 at 2:20 PM Benjamin Du  wrote:

> Hi everyone,
>
> I'd like to ask how does Spark (or more generally, distributed computing
> engines) handle RNGs? High-level speaking, there are two ways,
>
>1. Use a single RNG on the driver and random numbers generating on
>each work makes request to the single RNG on the driver.
>2. Use a separate RNG on each worker.
>
> If the 2nd approach above is used, may I ask how does Spark seed RNGs on
> different works to ensure the overall quality of random number generating?
>
>
> Best,
>
> 
>
> Ben Du
>
> Personal Blog  | GitHub
>  | Bitbucket 
> | Docker Hub 
>


Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
"Operations on the executor will generally calculate and store a seed once"

Can you elaborate more this? Does Spark try to seed RNGs to ensure overall 
quality of random number generating? To give an extremely example, if all 
workers use the same seed, then RNGs repeat the same numbers on each worker, 
which is obviously a poor choice.


Best,



Ben Du

Personal Blog<http://www.legendu.net/> | GitHub<https://github.com/dclong/> | 
Bitbucket<https://bitbucket.org/dclong/> | Docker 
Hub<https://hub.docker.com/r/dclong/>


From: Sean Owen 
Sent: Monday, October 4, 2021 1:00 PM
To: Benjamin Du 
Cc: user@spark.apache.org 
Subject: Re: [RNG]: How does Spark handle RNGs?

The 2nd approach. Spark doesn't work in the 1st way in any context - the driver 
and executor processes do not cooperate during execution.
Operations on the executor will generally calculate and store a seed once, and 
use that in RNGs, to make its computation reproducible.

On Mon, Oct 4, 2021 at 2:20 PM Benjamin Du 
mailto:legendu@outlook.com>> wrote:
Hi everyone,

I'd like to ask how does Spark (or more generally, distributed computing 
engines) handle RNGs? High-level speaking, there are two ways,

  1.  Use a single RNG on the driver and random numbers generating on each work 
makes request to the single RNG on the driver.
  2.  Use a separate RNG on each worker.

If the 2nd approach above is used, may I ask how does Spark seed RNGs on 
different works to ensure the overall quality of random number generating?


Best,



Ben Du

Personal Blog<http://www.legendu.net/> | GitHub<https://github.com/dclong/> | 
Bitbucket<https://bitbucket.org/dclong/> | Docker 
Hub<https://hub.docker.com/r/dclong/>


Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Sean Owen
No, it isn't making up new PRNGs. For some function that needs randomness
(e.g. sampling), a few things are important: has to be done independently
within each task, shouldn't be the same (almost surely) across tasks, needs
to be reproducible. You'll find if you look in the source code that
operations like this will generally pick and store a seed, create an RNG
with that seed and use it locally. Different tasks would have different
seeds.

On Mon, Oct 4, 2021 at 3:42 PM Benjamin Du  wrote:

> "Operations on the executor will generally calculate and store a seed once"
>
> Can you elaborate more this? Does Spark try to seed RNGs to ensure overall
> quality of random number generating? To give an extremely example, if all
> workers use the same seed, then RNGs repeat the same numbers on each
> worker, which is obviously a poor choice.
>
>
> Best,
>
> 
>
> Ben Du
>
> Personal Blog <http://www.legendu.net/> | GitHub
> <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/>
> | Docker Hub <https://hub.docker.com/r/dclong/>
>
> --
> *From:* Sean Owen 
> *Sent:* Monday, October 4, 2021 1:00 PM
> *To:* Benjamin Du 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: [RNG]: How does Spark handle RNGs?
>
> The 2nd approach. Spark doesn't work in the 1st way in any context - the
> driver and executor processes do not cooperate during execution.
> Operations on the executor will generally calculate and store a seed once,
> and use that in RNGs, to make its computation reproducible.
>
> On Mon, Oct 4, 2021 at 2:20 PM Benjamin Du 
> wrote:
>
> Hi everyone,
>
> I'd like to ask how does Spark (or more generally, distributed computing
> engines) handle RNGs? High-level speaking, there are two ways,
>
>1. Use a single RNG on the driver and random numbers generating on
>each work makes request to the single RNG on the driver.
>2. Use a separate RNG on each worker.
>
> If the 2nd approach above is used, may I ask how does Spark seed RNGs on
> different works to ensure the overall quality of random number generating?
>
>
> Best,
>
> 
>
> Ben Du
>
> Personal Blog <http://www.legendu.net/> | GitHub
> <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/>
> | Docker Hub <https://hub.docker.com/r/dclong/>
>
>