For an alternative take on a similar idea, see
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.
I
thanks! i will take a look at your code. didn't realize there was already
something out there.
good point about upper offsets, i will add that feature to our version as
well if you dont mind.
i was thinking about making it deterministic for task failure transparently
(even if no upper offsets
hello all,
we at tresata wrote a library to provide for batch integration between
spark and kafka (distributed write of rdd to kafa, distributed read of rdd
from kafka). our main use cases are (in lambda architecture jargon):
* period appends to the immutable master dataset on hdfs from kafka