Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/15102
  
    I spent a while playing around with this today on a real cluster, and 
overall it is pretty cool!  I have a few suggestions we should implement in the 
long run, but these can probably be done in follow-up PRs
     - It would be nice to be able to do something other than earliest/latest.  
In particular, I have some low volume streams where I want to go back a bit as 
I'm working interactively.  Going back all the way results in a really long 
job.  If there was a time index, I'd love to say `-10 minutes` but even just 
`-100 offsets would be great.
     - When specifying `earliest`, you end up with really big partitions.  As a 
result you get virtually no progress reporting until its done.  I'd consider 
having a max number of offsets in a task.  It should be easy to split them up 
in `getBatch`.  One question, is it a problem if two tasks are pulling from the 
same topic partition in parallel?  Does this break the assumptions of our 
caching?
     - Similarly, you get no results until the end.  We might also want to have 
a `maxOffsetsInBatch` similar to what we have in the `FileStreamSource`.
    
    I can do a final pass over the code, but do we think we are getting close 
to something that we can merge and iterate on?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to