Hi,
I am looking for someone who has made a spark streaming job that connects to
rabbitmq.
There is a lot of documentation how to make a connection with a java api (like
here: https://www.rabbitmq.com/api-guide.html#connecting) , but I am looking
for a recent working example for spark streaming
Hi,
I tried using the (typed) Dataset API about three years ago. Then
there were limitations with predicate pushdown, overhead serialization
and maybe more things I've forgotten. Ultimately we chose the
Dataframe API as the sweet spot.
Does anyone know of a good overview of the current state of t
Hello Ali!, I've implemented a dry run in my data pipeline using a schema
repository. My pipeline takes a "dataset descriptor", which is a json
describing the dataset you want to build, loads some "entities", applies
some transformations and then writes the final dataset.
Is in the "dataset descrip
Hello Sean,
Thank you for the heads-up !
Interaction transform won't help for my use case as it returns a vector
that I won't be able to hash.
I will definitely dig further into custom transformations though.
Thanks !
David
Le ven. 1 oct. 2021 à 15:49, Sean Owen a écrit :
> Are you looking for
Hey Ramiro,
Thank you for your detailed answer.
We also have a similar framework which does the same and I saw very good
results. However, pipelines using normal spark apps require change to adapt
to a framework and it requires a lot of effort. This is why I'm suggesting
adding it to spark core to
Hi spark users,
Can anyone please provide any views on the topic.
Regards
Amit Joshi
On Sunday, October 3, 2021, Amit Joshi wrote:
> Hi Spark-Users,
>
> Hope you are doing good.
>
> I have been working on cases where a dataframe is joined with more than
> one data frame separately, on differe
Hi everyone,
I'd like to ask how does Spark (or more generally, distributed computing
engines) handle RNGs? High-level speaking, there are two ways,
1. Use a single RNG on the driver and random numbers generating on each work
makes request to the single RNG on the driver.
2. Use a separat
Hi everyone,
I'd like to ask how does Spark (or more generally, distributed computing
engines) handle RNGs? High-level speaking, there are two ways,
1. Use a single RNG on the driver and random numbers generating on each work
makes request to the single RNG on the driver.
2. Use a separat
The 2nd approach. Spark doesn't work in the 1st way in any context - the
driver and executor processes do not cooperate during execution.
Operations on the executor will generally calculate and store a seed once,
and use that in RNGs, to make its computation reproducible.
On Mon, Oct 4, 2021 at 2:
"Operations on the executor will generally calculate and store a seed once"
Can you elaborate more this? Does Spark try to seed RNGs to ensure overall
quality of random number generating? To give an extremely example, if all
workers use the same seed, then RNGs repeat the same numbers on each wo
No, it isn't making up new PRNGs. For some function that needs randomness
(e.g. sampling), a few things are important: has to be done independently
within each task, shouldn't be the same (almost surely) across tasks, needs
to be reproducible. You'll find if you look in the source code that
operati
11 matches
Mail list logo