Thinking more about it – it should only be 2 tasks as A and B are most likely collapsed by spark in a single task.
Again – learn to use the spark UI as it’s really informative. The combination of DAG visualization and task count should answer most of your questions. -adrian From: Adrian Tanase Date: Monday, October 26, 2015 at 11:57 AM To: Nipun Arora, Andy Dang Cc: user Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming If I understand the order correctly, not really. First of all, the easiest way to make sure it works as expected is to check out the visual DAG in the spark UI. It should map 1:1 to your code, and since I don’t see any shuffles in the operations below it should execute all in one stage, forking after X. That means that all the executor cores will each process a partition completely in isolation, most likely 3 tasks (A, B, X2). Most likely in the order you define in code although depending on the data some tasks may get skipped or moved around. I’m curious – why do you ask? Do you have a particular concern or use case that relies on ordering between A, B and X2? -adrian From: Nipun Arora Date: Sunday, October 25, 2015 at 4:09 PM To: Andy Dang Cc: user Subject: Re: [SPARK STREAMING] Concurrent operations in spark streaming So essentially the driver/client program needs to explicitly have two threads to ensure concurrency? What happens when the program is sequential... I.e. I execute function A and then function B. Does this mean that each RDD first goes through function A, and them stream X is persisted, but processed in function B only after the RDD has been processed by A? Thanks Nipun On Sat, Oct 24, 2015 at 5:32 PM Andy Dang <nam...@gmail.com<mailto:nam...@gmail.com>> wrote: If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter. ------- Regards, Andy On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora <nipunarora2...@gmail.com<mailto:nipunarora2...@gmail.com>> wrote: I wanted to understand something about the internals of spark streaming executions. If I have a stream X, and in my program I send stream X to function A and function B: 1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file. 2. Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file Are both functions being executed for each RDD in parallel? How does it work? Thanks Nipun