Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit a Spark streaming job. I am currently using the direct approach for Kafka consumer. I have two questions:
1. In the direct approach, no offset is stored in zookeeper and no group id is specified. Can two consumers (one is Spark streaming and the other is a Kafak console consumer in Kafka package) read from the same topic from the brokers together (I would like both of them to get all messages, i.e. publish-subscribe mode)? What about two Spark streaming jobs read from the same topic? 2. How to avoid data loss if a Spark job is killed? Does checkpointing serve this purpose? The default behavior of Spark streaming is to read the latest logs. However, if a job is killed, can the new job resume from what was left to avoid loosing logs? Thanks! Bill