Now that 1.2 is finalized...  who are the go-to people to get some
long-standing Kafka related issues resolved?

The existing api is not sufficiently safe nor flexible for our production
use.  I don't think we're alone in this viewpoint, because I've seen
several different patches and libraries to fix the same things we've been
running into.

Regarding flexibility

https://issues.apache.org/jira/browse/SPARK-3146

has been outstanding since August, and IMHO an equivalent of this is
absolutely necessary.  We wrote a similar patch ourselves, then found that
PR and have been running it in production.  We wouldn't be able to get our
jobs done without it.  It also allows users to solve a whole class of
problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).

Regarding safety, I understand the motivation behind WriteAheadLog as a
general solution for streaming unreliable sources, but Kafka already is a
reliable source.  I think there's a need for an api that treats it as
such.  Even aside from the performance issues of duplicating the
write-ahead log in kafka into another write-ahead log in hdfs, I need
exactly-once semantics in the face of failure (I've had failures that
prevented reloading a spark streaming checkpoint, for instance).

I've got an implementation i've been using

https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
/src/main/scala/org/apache/spark/rdd/kafka

Tresata has something similar at https://github.com/tresata/spark-kafka,
and I know there were earlier attempts based on Storm code.

Trying to distribute these kinds of fixes as libraries rather than patches
to Spark is problematic, because large portions of the implementation are
private[spark].

 I'd like to help, but i need to know whose attention to get.

Reply via email to