Hey Cody,

Thanks for reaching out with this. The lead on streaming is TD - he is
traveling this week though so I can respond a bit. To the high level
point of whether Kafka is important - it definitely is. Something like
80% of Spark Streaming deployments (anecdotally) ingest data from
Kafka. Also, good support for Kafka is something we generally want in
Spark and not a library. In some cases IIRC there were user libraries
that used unstable Kafka API's and we were somewhat waiting on Kafka
to stabilize them to merge things upstream. Otherwise users wouldn't
be able to use newer Kakfa versions. This is a high level impression
only though, I haven't talked to TD about this recently so it's worth
revisiting given the developments in Kafka.

Please do bring things up like this on the dev list if there are
blockers for your usage - thanks for pinging it.

- Patrick

On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger <c...@koeninger.org> wrote:
> Now that 1.2 is finalized...  who are the go-to people to get some
> long-standing Kafka related issues resolved?
>
> The existing api is not sufficiently safe nor flexible for our production
> use.  I don't think we're alone in this viewpoint, because I've seen
> several different patches and libraries to fix the same things we've been
> running into.
>
> Regarding flexibility
>
> https://issues.apache.org/jira/browse/SPARK-3146
>
> has been outstanding since August, and IMHO an equivalent of this is
> absolutely necessary.  We wrote a similar patch ourselves, then found that
> PR and have been running it in production.  We wouldn't be able to get our
> jobs done without it.  It also allows users to solve a whole class of
> problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).
>
> Regarding safety, I understand the motivation behind WriteAheadLog as a
> general solution for streaming unreliable sources, but Kafka already is a
> reliable source.  I think there's a need for an api that treats it as
> such.  Even aside from the performance issues of duplicating the
> write-ahead log in kafka into another write-ahead log in hdfs, I need
> exactly-once semantics in the face of failure (I've had failures that
> prevented reloading a spark streaming checkpoint, for instance).
>
> I've got an implementation i've been using
>
> https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
> /src/main/scala/org/apache/spark/rdd/kafka
>
> Tresata has something similar at https://github.com/tresata/spark-kafka,
> and I know there were earlier attempts based on Storm code.
>
> Trying to distribute these kinds of fixes as libraries rather than patches
> to Spark is problematic, because large portions of the implementation are
> private[spark].
>
>  I'd like to help, but i need to know whose attention to get.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to