Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-18 Thread Jacek Laskowski
Hi Rachana, > Should I go backward and use Spark Streaming DStream based. No. Never. It's no longer supported (and should really be removed from the codebase once and for all - dreaming...). Spark focuses on Spark SQL and Spark Structured Streaming as user-facing modules for batch and streaming

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Structured Stream Vs Spark Steaming (DStream)? Which is recommended for system stability.  Exactly once is NOT first priority.  First priority is STABLE system. I am I need to make a decision soon.  I need help.  Here is the question again.  Should I go backward and use Spark Streaming DStream

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE at long as system does not fail every 5 to 7 days with no recovery option. On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava wrote: Thanks so much TD.  Thanks for forwarding your datalake

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Thanks so much TD.  Thanks for forwarding your datalake project but at this time we have budget constraints we can only use open source project.   I just want the Structured Streaming Application or Spark Streaming DStream Application to run without and issue for a long time..  I do not want

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Breno Arosa
Kafka-connect (https://docs.confluent.io/current/connect/index.html) may be an easier solution for this use case of just dumping kafka topics. On 17/06/2020 18:02, Jungtaek Lim wrote: Just in case if anyone prefers ASF projects then there are other alternative projects in ASF as well,

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Jungtaek Lim
Just in case if anyone prefers ASF projects then there are other alternative projects in ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg [2]. Both are recently graduated as top level projects. (DISCLAIMER: I'm not involved in both.) BTW it would be nice if we make the metadata

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Tathagata Das
Hello Rachana, Getting exactly-once semantics on files and making it scale to a very large number of files are very hard problems to solve. While Structured Streaming + built-in file sink solves the exactly-once guarantee that DStreams could not, it is definitely limited in other ways (scaling in

Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large as the streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs for a