Hi Rachana,
> Should I go backward and use Spark Streaming DStream based.
No. Never. It's no longer supported (and should really be removed from the
codebase once and for all - dreaming...).
Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
modules for batch and streaming
Structured Stream Vs Spark Steaming (DStream)?
Which is recommended for system stability. Exactly once is NOT first priority.
First priority is STABLE system.
I am I need to make a decision soon. I need help. Here is the question again.
Should I go backward and use Spark Streaming DStream
Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE
at long as system does not fail every 5 to 7 days with no recovery option.
On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava
wrote:
Thanks so much TD. Thanks for forwarding your datalake
Thanks so much TD. Thanks for forwarding your datalake project but at this
time we have budget constraints we can only use open source project.
I just want the Structured Streaming Application or Spark Streaming DStream
Application to run without and issue for a long time.. I do not want
Kafka-connect (https://docs.confluent.io/current/connect/index.html) may
be an easier solution for this use case of just dumping kafka topics.
On 17/06/2020 18:02, Jungtaek Lim wrote:
Just in case if anyone prefers ASF projects then there are other
alternative projects in ASF as well,
Just in case if anyone prefers ASF projects then there are other
alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
Apache Iceberg [2]. Both are recently graduated as top level projects.
(DISCLAIMER: I'm not involved in both.)
BTW it would be nice if we make the metadata
Hello Rachana,
Getting exactly-once semantics on files and making it scale to a very large
number of files are very hard problems to solve. While Structured Streaming
+ built-in file sink solves the exactly-once guarantee that DStreams could
not, it is definitely limited in other ways (scaling in
I have written a simple spark structured steaming app to move data from Kafka
to S3. Found that in order to support exactly-once guarantee spark creates
_spark_metadata folder, which ends up growing too large as the streaming app is
SUPPOSE TO run FOREVER. But when the streaming app runs for a