Yeah, that general plan should work, but might be a little awkward for adding topicPartitions after the fact (i.e. when you have stored offsets for some, but not all, of your topicpartitions)
Personally I just query kafka for the starting offsets if they dont exist in the DB, using the methods in KafkaCluster.scala. Yes, if you don't include a starting offset for a particular partition, it will be ignored. On Thu, Dec 3, 2015 at 3:31 PM, Dan Dutrow <dan.dut...@gmail.com> wrote: > Hey Cody, I'm convinced that I'm not going to get the functionality I want > without using the Direct Stream API. > > I'm now looking through > https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#exactly-once-using-transactional-writes > where you say "For the very first time the job is run, the table can be > pre-loaded with appropriate starting offsets." > > Could you provide some guidance on how to determine valid starting offsets > the very first time, particularly in my case where I have 10+ topics in > multiple different deployment environments with an unknown and potentially > dynamic number of partitions per topic per environment? > > I'd be happy if I could initialize all consumers to the value of > *auto.offset.reset > = "largest"*, record the partitions and offsets as they flow through > spark, and then use those discovered offsets from thereon out. > > I'm thinking I can probably just do some if/else logic and use the basic > createDirectStream and the more advanced > createDirectStream(...fromOffsets...) if the offsets for my topic name > exists in the database. Any reason that wouldn't work? If I don't include > an offset range for a particular partition, will that partition be ignored? > > > > > On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger <c...@koeninger.org> wrote: > >> Use the direct stream. You can put multiple topics in a single stream, >> and differentiate them on a per-partition basis using the offset range. >> >> On Wed, Dec 2, 2015 at 2:13 PM, dutrow <dan.dut...@gmail.com> wrote: >> >>> I found the JIRA ticket: >>> https://issues.apache.org/jira/browse/SPARK-2388 >>> >>> It was marked as invalid. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> -- > Dan ✆ >