date:20211004

[spark streaming] how to connect to rabbitmq with spark streaming.

2021-10-04 Thread Joris Billen

Hi, I am looking for someone who has made a spark streaming job that connects to rabbitmq. There is a lot of documentation how to make a connection with a java api (like here: https://www.rabbitmq.com/api-guide.html#connecting) , but I am looking for a recent working example for spark streaming

Current state of dataset api

2021-10-04 Thread Magnus Nilsson

Hi, I tried using the (typed) Dataset API about three years ago. Then there were limitations with predicate pushdown, overhead serialization and maybe more things I've forgotten. Ultimately we chose the Dataframe API as the sweet spot. Does anyone know of a good overview of the current state of t

Re: [Spark-Core] Spark Dry Run

2021-10-04 Thread Ramiro Laso

Hello Ali!, I've implemented a dry run in my data pipeline using a schema repository. My pipeline takes a "dataset descriptor", which is a json describing the dataset you want to build, loads some "entities", applies some transformations and then writes the final dataset. Is in the "dataset descrip

Re: Trying to hash cross features with mllib

2021-10-04 Thread David Diebold

Hello Sean, Thank you for the heads-up ! Interaction transform won't help for my use case as it returns a vector that I won't be able to hash. I will definitely dig further into custom transformations though. Thanks ! David Le ven. 1 oct. 2021 à 15:49, Sean Owen a écrit : > Are you looking for

Re: [Spark-Core] Spark Dry Run

2021-10-04 Thread Ali Behjati

Hey Ramiro, Thank you for your detailed answer. We also have a similar framework which does the same and I saw very good results. However, pipelines using normal spark apps require change to adapt to a framework and it requires a lot of effort. This is why I'm suggesting adding it to spark core to

Re: [Spark] Optimize spark join on different keys for same data frame

2021-10-04 Thread Amit Joshi

Hi spark users, Can anyone please provide any views on the topic. Regards Amit Joshi On Sunday, October 3, 2021, Amit Joshi wrote: > Hi Spark-Users, > > Hope you are doing good. > > I have been working on cases where a dataframe is joined with more than > one data frame separately, on differe

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du

Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a separat

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du

Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a separat

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Sean Owen

The 2nd approach. Spark doesn't work in the 1st way in any context - the driver and executor processes do not cooperate during execution. Operations on the executor will generally calculate and store a seed once, and use that in RNGs, to make its computation reproducible. On Mon, Oct 4, 2021 at 2:

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du

"Operations on the executor will generally calculate and store a seed once" Can you elaborate more this? Does Spark try to seed RNGs to ensure overall quality of random number generating? To give an extremely example, if all workers use the same seed, then RNGs repeat the same numbers on each wo

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Sean Owen

No, it isn't making up new PRNGs. For some function that needs randomness (e.g. sampling), a few things are important: has to be done independently within each task, shouldn't be the same (almost surely) across tasks, needs to be reproducible. You'll find if you look in the source code that operati

[spark streaming] how to connect to rabbitmq with spark streaming.

Current state of dataset api

Re: [Spark-Core] Spark Dry Run

Re: Trying to hash cross features with mllib

Re: [Spark-Core] Spark Dry Run

Re: [Spark] Optimize spark join on different keys for same data frame

[RNG]: How does Spark handle RNGs?

[RNG]: How does Spark handle RNGs?

Re: [RNG]: How does Spark handle RNGs?

Re: [RNG]: How does Spark handle RNGs?

Re: [RNG]: How does Spark handle RNGs?

11 matches

Site Navigation

Mail list logo

Footer information