Thanks. Looking at the KafkaCluster.scala code, (
https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala#L253),
it seems a little hacky for me to alter and recompile spark to expose those
methods, so I'll use the receiver API for the time being and watch for
changes as this API evolves to make those methods a little bit more
accessible. Meanwhile, I'll look into incorporating a database, like maybe
Tachyon, to persist offset and state data across redeployments.

On Fri, Aug 14, 2015 at 3:21 PM Cody Koeninger <c...@koeninger.org> wrote:

> I don't entirely agree with that assessment.  Not paying for extra cores
> to run receivers was about as important as delivery semantics, as far as
> motivations for the api.
>
> As I said in the jira tickets on the topic, if you want to use the direct
> api and save offsets to ZK, you can.   The right way to make that easier is
> to expose the (currently private) methods that already exist in
> KafkaCluster.scala for committing offsets through Kafka's api.  I don't
> think adding another "do the wrong thing" option is beneficial.
>
> On Fri, Aug 14, 2015 at 11:34 AM, dutrow <dan.dut...@gmail.com> wrote:
>
>> In summary, it appears that the use of the DirectAPI was intended
>> specifically to enable exactly-once semantics. This can be achieved for
>> idempotent transformations and with transactional processing using the
>> database to guarantee an "onto" mapping of results based on inputs. For
>> the
>> latter, you need to store your offsets in the database of record.
>>
>> If you as a developer do not necessarily need exactly-once semantics, then
>> you can probably get by fine using the receiver API.
>>
>> The hope is that one day the Direct API could be augmented with
>> Spark-abstracted offset storage (with zookeeper, kafka, or something else
>> outside of the Spark checkpoint), since this would allow developers to
>> easily take advantage of the Direct API's performance benefits and
>> simplification of parallelism. I think it would be worth adding, even if
>> it
>> were to come with some "buyer beware" caveats.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Maintaining-Kafka-Direct-API-Offsets-tp24246p24273.html
>
>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to