Yeah, that general plan should work, but might be a little awkward for
adding topicPartitions after the fact (i.e. when you have stored offsets
for some, but not all, of your topicpartitions)

Personally I just query kafka for the starting offsets if they dont exist
in the DB, using the methods in KafkaCluster.scala.

Yes, if you don't include a starting offset for a particular partition, it
will be ignored.

On Thu, Dec 3, 2015 at 3:31 PM, Dan Dutrow <dan.dut...@gmail.com> wrote:

> Hey Cody, I'm convinced that I'm not going to get the functionality I want
> without using the Direct Stream API.
>
> I'm now looking through
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#exactly-once-using-transactional-writes
> where you say "For the very first time the job is run, the table can be
> pre-loaded with appropriate starting offsets."
>
> Could you provide some guidance on how to determine valid starting offsets
> the very first time, particularly in my case where I have 10+ topics in
> multiple different deployment environments with an unknown and potentially
> dynamic number of partitions per topic per environment?
>
> I'd be happy if I could initialize all consumers to the value of 
> *auto.offset.reset
> = "largest"*, record the partitions and offsets as they flow through
> spark, and then use those discovered offsets from thereon out.
>
> I'm thinking I can probably just do some if/else logic and use the basic
> createDirectStream and the more advanced
> createDirectStream(...fromOffsets...) if the offsets for my topic name
> exists in the database. Any reason that wouldn't work? If I don't include
> an offset range for a particular partition, will that partition be ignored?
>
>
>
>
> On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger <c...@koeninger.org> wrote:
>
>> Use the direct stream.  You can put multiple topics in a single stream,
>> and differentiate them on a per-partition basis using the offset range.
>>
>> On Wed, Dec 2, 2015 at 2:13 PM, dutrow <dan.dut...@gmail.com> wrote:
>>
>>> I found the JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-2388
>>>
>>> It was marked as invalid.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>> --
> Dan ✆
>

Reply via email to