Thanks Mich. I understand now how to deal multiple streams in a single
job, but the responses I got before were very abstract and confusing.
So I had to go back to the Spark doc and figure out the details. This
is what I found out:
1. The standard and recommended way to do multi-stream processing in
Structured Streaming (not DStream) is to use the join operations.
No need to use collections and mapping functions (I guess these
could be for DStream). Once the you have created the
combined/joined DF, you use that DF's stream writer to process each
microbatch or event data before dumping results to the output sink
(the official Spark doc on this isn't very clear and the coding
examples were not complete,
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations)
2. The semantics on multi-stream processing is really mind-boggling.
You have to clearly define inner and outer join along with
watermarks conditions. We are still in the process of detailing our
use cases since we need to process three streams simultaneously in
near real-time, and we don't want to have any blocking situations
(i.e. one stream source doesn't produce any data for a considerable
time period). If you or anyone have any suggestions, I'd appreciate
your comments..
Thanks! -- ND
On 8/26/21 2:38 AM, Mich Talebzadeh wrote:
Hi ND,
Within the same Spark job you can handle two topics
simultaneously SSS. Is that what you are implying?
HTH
**view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk.Any and all responsibility for
any loss, damage or destruction of data or any other property which
may arise from relying on this email's technical content is explicitly
disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
On Tue, 24 Aug 2021 at 23:37, Artemis User <arte...@dtechspace.com
<mailto:arte...@dtechspace.com>> wrote:
Is there a way to run multiple streams in a single Spark job using
Structured Streaming? If not, is there an easy way to perform
inter-job
communications (e.g. referencing a dataframe among concurrent
jobs) in
Spark? Thanks a lot in advance!
-- ND
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>