Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Sean Owen
No, it isn't making up new PRNGs. For some function that needs randomness (e.g. sampling), a few things are important: has to be done independently within each task, shouldn't be the same (almost surely) across tasks, needs to be reproducible. You'll find if you look in the source code that

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
"Operations on the executor will generally calculate and store a seed once" Can you elaborate more this? Does Spark try to seed RNGs to ensure overall quality of random number generating? To give an extremely example, if all workers use the same seed, then RNGs repeat the same numbers on each

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Sean Owen
The 2nd approach. Spark doesn't work in the 1st way in any context - the driver and executor processes do not cooperate during execution. Operations on the executor will generally calculate and store a seed once, and use that in RNGs, to make its computation reproducible. On Mon, Oct 4, 2021 at

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a

Re: [Spark] Optimize spark join on different keys for same data frame

2021-10-04 Thread Amit Joshi
Hi spark users, Can anyone please provide any views on the topic. Regards Amit Joshi On Sunday, October 3, 2021, Amit Joshi wrote: > Hi Spark-Users, > > Hope you are doing good. > > I have been working on cases where a dataframe is joined with more than > one data frame separately, on

Re: [Spark-Core] Spark Dry Run

2021-10-04 Thread Ali Behjati
Hey Ramiro, Thank you for your detailed answer. We also have a similar framework which does the same and I saw very good results. However, pipelines using normal spark apps require change to adapt to a framework and it requires a lot of effort. This is why I'm suggesting adding it to spark core

Re: Trying to hash cross features with mllib

2021-10-04 Thread David Diebold
Hello Sean, Thank you for the heads-up ! Interaction transform won't help for my use case as it returns a vector that I won't be able to hash. I will definitely dig further into custom transformations though. Thanks ! David Le ven. 1 oct. 2021 à 15:49, Sean Owen a écrit : > Are you looking

Re: [Spark-Core] Spark Dry Run

2021-10-04 Thread Ramiro Laso
Hello Ali!, I've implemented a dry run in my data pipeline using a schema repository. My pipeline takes a "dataset descriptor", which is a json describing the dataset you want to build, loads some "entities", applies some transformations and then writes the final dataset. Is in the "dataset

Current state of dataset api

2021-10-04 Thread Magnus Nilsson
Hi, I tried using the (typed) Dataset API about three years ago. Then there were limitations with predicate pushdown, overhead serialization and maybe more things I've forgotten. Ultimately we chose the Dataframe API as the sweet spot. Does anyone know of a good overview of the current state of

[spark streaming] how to connect to rabbitmq with spark streaming.

2021-10-04 Thread Joris Billen
Hi, I am looking for someone who has made a spark streaming job that connects to rabbitmq. There is a lot of documentation how to make a connection with a java api (like here: https://www.rabbitmq.com/api-guide.html#connecting) , but I am looking for a recent working example for spark streaming