Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/15102 I spent a while playing around with this today on a real cluster, and overall it is pretty cool! I have a few suggestions we should implement in the long run, but these can probably be done in follow-up PRs - It would be nice to be able to do something other than earliest/latest. In particular, I have some low volume streams where I want to go back a bit as I'm working interactively. Going back all the way results in a really long job. If there was a time index, I'd love to say `-10 minutes` but even just `-100 offsets would be great. - When specifying `earliest`, you end up with really big partitions. As a result you get virtually no progress reporting until its done. I'd consider having a max number of offsets in a task. It should be easy to split them up in `getBatch`. One question, is it a problem if two tasks are pulling from the same topic partition in parallel? Does this break the assumptions of our caching? - Similarly, you get no results until the end. We might also want to have a `maxOffsetsInBatch` similar to what we have in the `FileStreamSource`. I can do a final pass over the code, but do we think we are getting close to something that we can merge and iterate on?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org