Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
If you're looking for some kind of instrumentation finer than at batch
boundaries, you'd have to do something with the individual messages
yourself.  You have full access to the individual messages including
offset.

On Thu, Apr 27, 2017 at 1:27 PM, Dominik Safaric
 wrote:
> Of course I am not asking to commit for every message. But instead of, 
> seeking to commit the last consumed offset at a given interval. For example, 
> from the 1st until the 5th second, messages until offset 100.000 of the 
> partition 10 were consumed, then from the 6th until the 10th second of 
> executing the last consumed offset of the same partition was 200.000 - and so 
> forth. This is the information I seek to get.
>
>> On 27 Apr 2017, at 20:11, Cody Koeninger  wrote:
>>
>> Are you asking for commits for every message?  Because that will kill
>> performance.
>>
>> On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
>>  wrote:
>>> Indeed I have. But, even when storing the offsets in Spark and committing 
>>> offsets upon completion of an output operation within the foreachRDD call 
>>> (as pointed in the example), the only offset that Spark’s Kafka 
>>> implementation commits to Kafka is the offset of the last message. For 
>>> example, if I have 100 million messages, then Spark will commit only the 
>>> 100 millionth offset, and the offsets of the intermediate batches - and 
>>> hence the questions.
>>>
 On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:

 have you read

 http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself

 On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
  wrote:
> The reason why I want to obtain this information, i.e.  offset, timestamp> tuples is to relate the consumption with the 
> production rates using the __consumer_offsets Kafka internal topic. 
> Interestedly, the Spark’s KafkaConsumer implementation does not auto 
> commit the offsets upon offset commit expiration, because as seen in the 
> logs, Spark overrides the enable.auto.commit property to false.
>
> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
> mind that I do not care about exactly-once, hence having messages 
> replayed is perfectly fine.
>
>> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
>>
>> What is it you're actually trying to accomplish?
>>
>> You can get topic, partition, and offset bounds from an offset range like
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>
>> Timestamp isn't really a meaningful idea for a range of offsets.
>>
>>
>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>  wrote:
>>> Hi all,
>>>
>>> Because the Spark Streaming direct Kafka consumer maps offsets for a 
>>> given
>>> Kafka topic and a partition internally while having enable.auto.commit 
>>> set
>>> to false, how can I retrieve the offset of each made consumer’s poll 
>>> call
>>> using the offset ranges of an RDD? More precisely, the information I 
>>> seek to
>>> get after each poll call is the following: >> partition>.
>>>
>>> Thanks in advance,
>>> Dominik
>>>
>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
Of course I am not asking to commit for every message. But instead of, seeking 
to commit the last consumed offset at a given interval. For example, from the 
1st until the 5th second, messages until offset 100.000 of the partition 10 
were consumed, then from the 6th until the 10th second of executing the last 
consumed offset of the same partition was 200.000 - and so forth. This is the 
information I seek to get. 

> On 27 Apr 2017, at 20:11, Cody Koeninger  wrote:
> 
> Are you asking for commits for every message?  Because that will kill
> performance.
> 
> On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
>  wrote:
>> Indeed I have. But, even when storing the offsets in Spark and committing 
>> offsets upon completion of an output operation within the foreachRDD call 
>> (as pointed in the example), the only offset that Spark’s Kafka 
>> implementation commits to Kafka is the offset of the last message. For 
>> example, if I have 100 million messages, then Spark will commit only the 100 
>> millionth offset, and the offsets of the intermediate batches - and hence 
>> the questions.
>> 
>>> On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:
>>> 
>>> have you read
>>> 
>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
>>> 
>>> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>>>  wrote:
 The reason why I want to obtain this information, i.e. >>> timestamp> tuples is to relate the consumption with the production rates 
 using the __consumer_offsets Kafka internal topic. Interestedly, the 
 Spark’s KafkaConsumer implementation does not auto commit the offsets upon 
 offset commit expiration, because as seen in the logs, Spark overrides the 
 enable.auto.commit property to false.
 
 Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
 mind that I do not care about exactly-once, hence having messages replayed 
 is perfectly fine.
 
> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
> 
> What is it you're actually trying to accomplish?
> 
> You can get topic, partition, and offset bounds from an offset range like
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
> 
> Timestamp isn't really a meaningful idea for a range of offsets.
> 
> 
> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>  wrote:
>> Hi all,
>> 
>> Because the Spark Streaming direct Kafka consumer maps offsets for a 
>> given
>> Kafka topic and a partition internally while having enable.auto.commit 
>> set
>> to false, how can I retrieve the offset of each made consumer’s poll call
>> using the offset ranges of an RDD? More precisely, the information I 
>> seek to
>> get after each poll call is the following: > partition>.
>> 
>> Thanks in advance,
>> Dominik
>> 
 
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
Are you asking for commits for every message?  Because that will kill
performance.

On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
 wrote:
> Indeed I have. But, even when storing the offsets in Spark and committing 
> offsets upon completion of an output operation within the foreachRDD call (as 
> pointed in the example), the only offset that Spark’s Kafka implementation 
> commits to Kafka is the offset of the last message. For example, if I have 
> 100 million messages, then Spark will commit only the 100 millionth offset, 
> and the offsets of the intermediate batches - and hence the questions.
>
>> On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:
>>
>> have you read
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
>>
>> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>>  wrote:
>>> The reason why I want to obtain this information, i.e. >> timestamp> tuples is to relate the consumption with the production rates 
>>> using the __consumer_offsets Kafka internal topic. Interestedly, the 
>>> Spark’s KafkaConsumer implementation does not auto commit the offsets upon 
>>> offset commit expiration, because as seen in the logs, Spark overrides the 
>>> enable.auto.commit property to false.
>>>
>>> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
>>> mind that I do not care about exactly-once, hence having messages replayed 
>>> is perfectly fine.
>>>
 On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:

 What is it you're actually trying to accomplish?

 You can get topic, partition, and offset bounds from an offset range like

 http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets

 Timestamp isn't really a meaningful idea for a range of offsets.


 On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
  wrote:
> Hi all,
>
> Because the Spark Streaming direct Kafka consumer maps offsets for a given
> Kafka topic and a partition internally while having enable.auto.commit set
> to false, how can I retrieve the offset of each made consumer’s poll call
> using the offset ranges of an RDD? More precisely, the information I seek 
> to
> get after each poll call is the following: .
>
> Thanks in advance,
> Dominik
>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
Indeed I have. But, even when storing the offsets in Spark and committing 
offsets upon completion of an output operation within the foreachRDD call (as 
pointed in the example), the only offset that Spark’s Kafka implementation 
commits to Kafka is the offset of the last message. For example, if I have 100 
million messages, then Spark will commit only the 100 millionth offset, and the 
offsets of the intermediate batches - and hence the questions. 

> On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:
> 
> have you read
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
> 
> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>  wrote:
>> The reason why I want to obtain this information, i.e. > timestamp> tuples is to relate the consumption with the production rates 
>> using the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s 
>> KafkaConsumer implementation does not auto commit the offsets upon offset 
>> commit expiration, because as seen in the logs, Spark overrides the 
>> enable.auto.commit property to false.
>> 
>> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
>> mind that I do not care about exactly-once, hence having messages replayed 
>> is perfectly fine.
>> 
>>> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
>>> 
>>> What is it you're actually trying to accomplish?
>>> 
>>> You can get topic, partition, and offset bounds from an offset range like
>>> 
>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>> 
>>> Timestamp isn't really a meaningful idea for a range of offsets.
>>> 
>>> 
>>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>>  wrote:
 Hi all,
 
 Because the Spark Streaming direct Kafka consumer maps offsets for a given
 Kafka topic and a partition internally while having enable.auto.commit set
 to false, how can I retrieve the offset of each made consumer’s poll call
 using the offset ranges of an RDD? More precisely, the information I seek 
 to
 get after each poll call is the following: .
 
 Thanks in advance,
 Dominik
 
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
have you read

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself

On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
 wrote:
> The reason why I want to obtain this information, i.e.  timestamp> tuples is to relate the consumption with the production rates 
> using the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s 
> KafkaConsumer implementation does not auto commit the offsets upon offset 
> commit expiration, because as seen in the logs, Spark overrides the 
> enable.auto.commit property to false.
>
> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
> mind that I do not care about exactly-once, hence having messages replayed is 
> perfectly fine.
>
>> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
>>
>> What is it you're actually trying to accomplish?
>>
>> You can get topic, partition, and offset bounds from an offset range like
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>
>> Timestamp isn't really a meaningful idea for a range of offsets.
>>
>>
>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>  wrote:
>>> Hi all,
>>>
>>> Because the Spark Streaming direct Kafka consumer maps offsets for a given
>>> Kafka topic and a partition internally while having enable.auto.commit set
>>> to false, how can I retrieve the offset of each made consumer’s poll call
>>> using the offset ranges of an RDD? More precisely, the information I seek to
>>> get after each poll call is the following: .
>>>
>>> Thanks in advance,
>>> Dominik
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
The reason why I want to obtain this information, i.e.  tuples is to relate the consumption with the production rates using 
the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s 
KafkaConsumer implementation does not auto commit the offsets upon offset 
commit expiration, because as seen in the logs, Spark overrides the 
enable.auto.commit property to false. 

Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in mind 
that I do not care about exactly-once, hence having messages replayed is 
perfectly fine.   

> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
> 
> What is it you're actually trying to accomplish?
> 
> You can get topic, partition, and offset bounds from an offset range like
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
> 
> Timestamp isn't really a meaningful idea for a range of offsets.
> 
> 
> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>  wrote:
>> Hi all,
>> 
>> Because the Spark Streaming direct Kafka consumer maps offsets for a given
>> Kafka topic and a partition internally while having enable.auto.commit set
>> to false, how can I retrieve the offset of each made consumer’s poll call
>> using the offset ranges of an RDD? More precisely, the information I seek to
>> get after each poll call is the following: .
>> 
>> Thanks in advance,
>> Dominik
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
What is it you're actually trying to accomplish?

You can get topic, partition, and offset bounds from an offset range like

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets

Timestamp isn't really a meaningful idea for a range of offsets.


On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
 wrote:
> Hi all,
>
> Because the Spark Streaming direct Kafka consumer maps offsets for a given
> Kafka topic and a partition internally while having enable.auto.commit set
> to false, how can I retrieve the offset of each made consumer’s poll call
> using the offset ranges of an RDD? More precisely, the information I seek to
> get after each poll call is the following: .
>
> Thanks in advance,
> Dominik
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-25 Thread Dominik Safaric
Hi all,

Because the Spark Streaming direct Kafka consumer maps offsets for a given 
Kafka topic and a partition internally while having enable.auto.commit set to 
false, how can I retrieve the offset of each made consumer’s poll call using 
the offset ranges of an RDD? More precisely, the information I seek to get 
after each poll call is the following: . 

Thanks in advance,
Dominik