Now that 1.2 is finalized... who are the go-to people to get some long-standing Kafka related issues resolved?
The existing api is not sufficiently safe nor flexible for our production use. I don't think we're alone in this viewpoint, because I've seen several different patches and libraries to fix the same things we've been running into. Regarding flexibility https://issues.apache.org/jira/browse/SPARK-3146 has been outstanding since August, and IMHO an equivalent of this is absolutely necessary. We wrote a similar patch ourselves, then found that PR and have been running it in production. We wouldn't be able to get our jobs done without it. It also allows users to solve a whole class of problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc). Regarding safety, I understand the motivation behind WriteAheadLog as a general solution for streaming unreliable sources, but Kafka already is a reliable source. I think there's a need for an api that treats it as such. Even aside from the performance issues of duplicating the write-ahead log in kafka into another write-ahead log in hdfs, I need exactly-once semantics in the face of failure (I've had failures that prevented reloading a spark streaming checkpoint, for instance). I've got an implementation i've been using https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka /src/main/scala/org/apache/spark/rdd/kafka Tresata has something similar at https://github.com/tresata/spark-kafka, and I know there were earlier attempts based on Storm code. Trying to distribute these kinds of fixes as libraries rather than patches to Spark is problematic, because large portions of the implementation are private[spark]. I'd like to help, but i need to know whose attention to get.
