[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646419#comment-14646419
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Thanks Cody and Sean. I'll take a look at the process of submitting changes.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645923#comment-14645923
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Sean, I specifically did not want to add my comments there because that ticket 
is marked as Won't fix. And now you seem to be resolving this as a DUP, which 
IMHO is too quick a resolution without a discussion.

Do you want to manage Kafka streaming using Kafka offsets?

The answer is, it depends.  I first need to know if that is what I will have to 
do.  Please re-read what I stated:

On the one hand side, we're hearing If you are restarting the streaming app 
with Direct kafka from the checkpoint information (that is, restarting), then 
the last read offsets are automatically recovered, and the data will start 
processing from that offset. All the N records added in T will stay buffered in 
Kafka. (where T is the interval of time during which the consumer was down).

If Direct streaming already does the progress persistence for us, I'm all for 
it and problem solved. I will need to know, if that is the case, how to enable 
this behavior, because I am not seeing it in my testing.

However, if, to achieve the effect of resuming from where the consumer is left 
off, if I need to manually manage the offsets then yes, I want to file an 
enhancement request which would make this option explicit in the API rather 
than us having to implement it.  In the meantime, a fuller sample for 
persisting and retrieving offsets with OffsetCommitRequest, OffsetFetchRequest 
would also be helpful.  Right now, the existing sample in the Kafka streaming 
doc doesn't include a fetch example and the update offsets example doesn't 
fully demonstrate the update logic.




 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645952#comment-14645952
 ] 

Sean Owen commented on SPARK-9434:
--

[~dgoldenberg] Discussion is good; I'm asking you to put it in the right place. 
The mailing list is a good place; the existing discussion is a good place. 
Existing JIRAs can be reopened -- if appropriate. I'd suggest either one of 
those, with a preference for the mailing list, since you have a question at 
this point rather than a specific change to propose. 

Right now, it sounds like you're looking for Approach 2 in 
https://spark.apache.org/docs/latest/streaming-kafka-integration.html  This 
describes how direct receivers store and recover offsets automatically, which 
is what you appear to be asking for. This mechanism does not update ZK. 
However, farther down, it describes how you can update these offsets within 
Spark Streaming directly without interacting with ZK directly. Look for 
HasOffsetRanges.

I am going to re-close this, soon, unless there is something here that is not 
questioning/continuing the discussion that already took place in SPARK-6249, 
and is a proposal for doc/code change rather than a query about how to use the 
API. 

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645971#comment-14645971
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

[~tdas]: Could you weigh in on this discussion?  In our previous exchanges, we 
talked about If you are restarting the streaming app with Direct kafka from 
the checkpoint information (that is, restarting), then the last read offsets 
are automatically recovered, and the data will start processing from that 
offset. All the N records added in T will stay buffered in Kafka. (where T is 
the interval of time during which the consumer was down)

Is it indeed the case that the *last read offsets are automatically 
recovered?? I am not seeing that. Our consumers resume from the first data 
that's added to Kafka *after the consumers have been restarted, leading to the 
loss of data added *while the consumers were down*.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646043#comment-14646043
 ] 

Cody Koeninger commented on SPARK-9434:
---

Dmitry, the nabble link you posted indicates your message was never accepted by 
the mailing list.

Looking through my email archives, there are only 3 discussions related to 
Kafka that you successfully sent to the mailing list, and all of them got 
answers.  As sean said, I think that's a better forum for questions.

To try and address what I think your question is... no, if you don't turn on 
checkpointing, you won't have any offsets saved in a checkpoint.  Checkpointing 
isn't necessary for every job, so it is not turned on by default.  The docs for 
KafkaUtils already explicitly say To recover from driver failures, you have to 
enable checkpointing, with a link to the guide on how to do that.

The closed tickets you're linking to relate to a desire for some kind of 
automated committing to, and/or recovery from, Zookeeper based offsets.  As I 
stated in those tickets, I think the correct way to make this easier for people 
is just to expose an easier api for interacting with ZK.  That already exists 
in the code, it's just private.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645992#comment-14645992
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

@Sean Owen: Per your point about placing this discussion into the mailing list, 
I [had placed my question 
there|http://apache-spark-user-list.1001560.n3.nabble.com/What-happens-when-a-streaming-consumer-job-is-killed-then-restarted-tt23348.html]
 and it has remained unanswered there since June 16th.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646011#comment-14646011
 ] 

Sean Owen commented on SPARK-9434:
--

Yes, I mean direct stream. While I don't think you should open a JIRA to get 
attention for your question, I do think it's clearer what this is about now and 
it could be a legitimate issue, so let me see if I can move it along one tick:

Per https://spark.apache.org/docs/latest/streaming-kafka-integration.html I 
think the answer is no you do not manually manage offsets in order to achieve 
the exactly-once semantics advertised by the Kafka direct stream. The offsets 
are recorded with checkpoints. However from glancing at the code, that also 
implies to me that you have to enable checkpointing to get this to work. That 
is not shown in the example; I am not sure whether my read is wrong, or whether 
it's implied, or just omitted from the example.

If that's correct, then I think that's the solution you want (or else, manage 
offsets manually in ZK -- I do that and it works fine) but the example needs to 
be updated to show checkpointing.

Otherwise, the question is why you're not getting the expected behavior 
(without checkpoints), and a simple reproduction would help.

CC [~c...@koeninger.org]

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646021#comment-14646021
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Great, thanks, Sean. Perhaps Cody and/or Tathagata can weigh in with the 
details of how checkpointing would be enabled for direct streaming.  I agree 
that if that is the case, this would deserve a clarification in the Kafka 
integration guide, preferably with a code sample.

A fuller example of manual ZK offset update would be great, whether we have to 
go the else route you mentioned or not.  We may need to eventually do that 
anyway in order to enable direct streaming monitoring, which relies on ZK 
offsets.

For now though all I'm trying to do is get the checkpoint-based progress 
persistence to work, so a clear example would be great.

By the way, how does the checkpointing persistence work?  Where on a worker 
node would I be able to find this checkpointing datastore and is it easily 
browsable? Thanks.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646033#comment-14646033
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Ah. I think this is making sense then :) 
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646045#comment-14646045
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Perhaps the Kafka user guide could reference this part of the doc? it'd make it 
a lot easier to make the connection...

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646047#comment-14646047
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Yup, Cody, just connected the dots with the checkpointing doc. The Kafka and 
checkpointing docs seem disjoint so it's not trivial to make the connection 
from the get-go. Thanks.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646113#comment-14646113
 ] 

Cody Koeninger commented on SPARK-9434:
---

If you want to submit a PR with doc changes to add another link between the 
two, go for it.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645965#comment-14645965
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

[~sowen]: Putting this here was the only way to start getting true attention 
and more definitive answers.

We *are* using Approach 2, the direct streaming. As a slight side note, there 
are no receivers there so the term direct receivers that you seem to be using 
doesn't apply.

That said, the doc states: Offsets are tracked by Spark Streaming within its 
checkpoints.  Can you confirm that this storage of offsets is *only* done for 
the _smallest_ and _greatest_ values of auto.reset.offset ?  This really 
appears to be the case.  And if that is the case, then I'll need to use manual 
update of offsets so that we can manually ascertain where to resume processing.

If that is the case, I'm filing for an enhancement request for this being an 
option in the API.

From the sample perspective, at least the ... in the below needs to be 
filled in:
{code}
foreachRDD(
   new FunctionJavaPairRDDString, String, Void() {
 @Override
 public Void call(JavaPairRDDString, String rdd) throws IOException {
   for (OffsetRange o : offsetRanges.get()) {
 System.out.println(
   o.topic() +   + o.partition() +   + o.fromOffset() +   + 
o.untilOffset()
 );
   }
   ...
   return null;
 }
   }
 );
{code}

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org