Github user koeninger commented on the issue:

    https://github.com/apache/spark/pull/15102
  
    
    > It would be nice to be able to do something other than earliest/latest.
    
    That's what Assign and the starting offset arguments to the Subscribe 
strategies are for.  The implementation was already there.
    
    > When specifying earliest, you end up with really big partitions. 
    
    Again, spark.streaming.kafka.maxRatePerPartition and the associated 
implementation was already there.  If you don't want the coupling to time, it's 
pretty straightforward.  The bigger question is when / if / how you're going to 
do backpressure.
    
    >  One question, is it a problem if two tasks are pulling from the same 
topic partition in parallel? Does this break the assumptions of our caching?
    
    This breaks fundamental assumptions of Kafka (per-topicpartition ordering) 
and really shouldn't be done.
    
    > I can do a final pass over the code, but do we think we are getting close 
to something that we can merge and iterate on?
    
    I think we're in much better shape than when we started, but I still 
honestly think this implementation made a bunch of user-visible behavioral and 
configuration changes from the DStream that really have nothing to do with the 
inherent differences between it and structured streaming.  This isn't just me 
whining about "you changed my code",  it really is going to make it harder to 
explain to people and harder to maintain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to