Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it.
2015-06-12 13:59 GMT+08:00 Amit Ramesh <a...@yelp.com>: > > Thanks, Jerry. That's what I suspected based on the code I looked at. Any > pointers on what is needed to build in this support would be great. This is > critical to the project we are currently working on. > > Thanks! > > > On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> OK, I get it, I think currently Python based Kafka direct API do not >> provide such equivalence like Scala, maybe we should figure out to add this >> into Python API also. >> >> 2015-06-12 13:48 GMT+08:00 Amit Ramesh <a...@yelp.com>: >> >>> >>> Hi Jerry, >>> >>> Take a look at this example: >>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 >>> >>> The offsets are needed because as RDDs get generated within spark the >>> offsets move further along. With direct Kafka mode the current offsets are >>> no more persisted in Zookeeper but rather within Spark itself. If you want >>> to be able to use zookeeper based monitoring tools to keep track of >>> progress, then this is needed. >>> >>> In my specific case we need to persist Kafka offsets externally so that >>> we can continue from where we left off after a code deployment. In other >>> words, we need exactly-once processing guarantees across code deployments. >>> Spark does not support any state persistence across deployments so this is >>> something we need to handle on our own. >>> >>> Hope that helps. Let me know if not. >>> >>> Thanks! >>> Amit >>> >>> >>> On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao <sai.sai.s...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> What is your meaning of getting the offsets from the RDD, from my >>>> understanding, the offsetRange is a parameter you offered to KafkaRDD, why >>>> do you still want to get the one previous you set into? >>>> >>>> Thanks >>>> Jerry >>>> >>>> 2015-06-12 12:36 GMT+08:00 Amit Ramesh <a...@yelp.com>: >>>> >>>>> >>>>> Congratulations on the release of 1.4! >>>>> >>>>> I have been trying out the direct Kafka support in python but haven't >>>>> been able to figure out how to get the offsets from the RDD. Looks like >>>>> the >>>>> documentation is yet to be updated to include Python examples ( >>>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html). >>>>> I am specifically looking for the equivalent of >>>>> https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2. >>>>> I tried digging through the python code but could not find anything >>>>> related. Any pointers would be greatly appreciated. >>>>> >>>>> Thanks! >>>>> Amit >>>>> >>>>> >>>> >>> >> >