Hi, I have a setup (in mind) where data is written to Kafka and this data is persisted in HDFS (e.g., using camus) so that I have an all-time archive of all stream data ever received. Now I want to process that all-time archive and when I am done with that, continue with the live stream, using Spark Streaming. (In a perfect world, Kafka would have infinite storage and I would always use the Kafka receiver, starting from offset 0.) Does anyone have an idea how to realize such a setup? Would I write a custom receiver that first reads the HDFS file and then connects to Kafka? Is there an existing solution for that use case?
Thanks Tobias