[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

Dmitry Goldenberg (JIRA) Wed, 29 Jul 2015 05:27:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645965#comment-14645965
 ]


Dmitry Goldenberg commented on SPARK-9434:
------------------------------------------

[~sowen]: Putting this here was the only way to start getting true attention 
and more definitive answers.

We *are* using Approach 2, the direct streaming. As a slight side note, there 
are no receivers there so the term "direct receivers" that you seem to be using 
doesn't apply.

That said, the doc states: "Offsets are tracked by Spark Streaming within its 
checkpoints."  Can you confirm that this storage of offsets is *only* done for 
the _smallest_ and _greatest_ values of auto.reset.offset ?  This really 
appears to be the case.  And if that is the case, then I'll need to use manual 
update of offsets so that we can manually ascertain where to resume processing.

If that is the case, I'm filing for an enhancement request for this being an 
option in the API.

>From the sample perspective, at least the "..." in the below needs to be 
>filled in:
{code}
foreachRDD(
   new Function<JavaPairRDD<String, String>, Void>() {
     @Override
     public Void call(JavaPairRDD<String, String> rdd) throws IOException {
       for (OffsetRange o : offsetRanges.get()) {
         System.out.println(
           o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + 
o.untilOffset()
         );
       }
       ...
       return null;
     }
   }
 );
{code}

> Need how-to for resuming direct Kafka streaming consumers where they had left 
> off before getting terminated, OR actual support for that mode in the 
> Streaming API
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-9434
>                 URL: https://issues.apache.org/jira/browse/SPARK-9434
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation, Examples, Streaming
>    Affects Versions: 1.4.1
>            Reporter: Dmitry Goldenberg
>
> We've been getting some mixed information regarding how to cause our direct 
> streaming consumers to resume processing from where they left off in terms of 
> the Kafka offsets.
> On the one hand side, we're hearing "If you are restarting the streaming app 
> with Direct kafka from the checkpoint information (that is, restarting), then 
> the last read offsets are automatically recovered, and the data will start 
> processing from that offset. All the N records added in T will stay buffered 
> in Kafka." (where T is the interval of time during which the consumer was 
> down).
> On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
> are marked as "won't fix" which seem to ask for the functionality we need, 
> with comments like "I don't want to add more config options with confusing 
> semantics around what is being used for the system of record for offsets, I'd 
> rather make it easy for people to explicitly do what they need."
> The use-case is actually very clear and doesn't ask for confusing semantics. 
> An API option to resume reading where you left off, in addition to the 
> smallest or greatest auto.offset.reset should be *very* useful, probably for 
> quite a few folks.
> We're asking for this as an enhancement request. SPARK-8833 states " I am 
> waiting for getting enough usecase to float in before I take a final call." 
> We're adding to that.
> In the meantime, can you clarify the confusion?  Does direct streaming 
> persist the progress information into "DStream checkpoints" or does it not?  
> If it does, why is it that we're not seeing that happen? Our consumers start 
> with auto.offset.reset=greatest and that causes them to read from the first 
> offset of data that is written to Kafka *after* the consumer has been 
> restarted, meaning we're missing data that had come in while the consumer was 
> down.
> If the progress is stored in "DStream checkpoints", we want to know a) how to 
> cause that to work for us and b) where the said checkpointing data is stored 
> physically.
> Conversely, if this is not accurate, then is our only choice to manually 
> persist the offsets into Zookeeper? If that is the case then a) we'd like a 
> clear, more complete code sample to be published, since the one in the Kafka 
> streaming guide is incomplete (it lacks the actual lines of code persisting 
> the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
> feature worth implementing in the API.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

Reply via email to