We have a Spark job that produces a result data frame, say DF-1 at the
end of the pipeline (i.e. Proc-1). From DF-1, we need to create two or
more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes,
i.e. Proc-2 and Proc-3. Ideally, we would like to perform Proc-2 and
Proc-3 in parallel, since Proc-2 and Proc-3 can be executed
independently, with DF-1 made immutable and DF-2 and DF-3 are
mutual-exclusive.
Does Spark has some built-in APIs to support spawning sub-jobs in a
single session? If multi-threading is needed, what are the common best
practices in this case?
Thanks in advance for your help!
-- ND
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org