If I understand the order correctly, not really. First of all, the easiest way 
to make sure it works as expected is to check out the visual DAG in the spark 
UI.

It should map 1:1 to your code, and since I don’t see any shuffles in the 
operations below it should execute all in one stage, forking after X.
That means that all the executor cores will each process a partition completely 
in isolation, most likely 3 tasks (A, B, X2). Most likely in the order you 
define in code although depending on the data some tasks may get skipped or 
moved around.

I’m curious – why do you ask? Do you have a particular concern or use case that 
relies on ordering between A, B and X2?

-adrian

From: Nipun Arora
Date: Sunday, October 25, 2015 at 4:09 PM
To: Andy Dang
Cc: user
Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming

So essentially the driver/client program needs to explicitly have two threads 
to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A and 
then function B. Does this mean that each RDD first goes through function A, 
and them stream X is persisted, but processed in function B only after the RDD 
has been processed by A?

Thanks
Nipun
On Sat, Oct 24, 2015 at 5:32 PM Andy Dang 
<nam...@gmail.com<mailto:nam...@gmail.com>> wrote:
If you execute the collect step (foreach in 1, possibly reduce in 2) in two 
threads in the driver then both of them will be executed in parallel. Whichever 
gets submitted to Spark first gets executed first - you can use a semaphore if 
you need to ensure the ordering of execution, though I would assume that the 
ordering wouldn't matter.

-------
Regards,
Andy

On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora 
<nipunarora2...@gmail.com<mailto:nipunarora2...@gmail.com>> wrote:
I wanted to understand something about the internals of spark streaming 
executions.

If I have a stream X, and in my program I send stream X to function A and 
function B:

1. In function A, I do a few transform/filter operations etc. on X->Y->Z to 
create stream Z. Now I do a forEach Operation on Z and print the output to a 
file.

2. Then in function B, I reduce stream X -> X2 (say min value of each RDD), and 
print the output to file

Are both functions being executed for each RDD in parallel? How does it work?

Thanks
Nipun


Reply via email to