Re: About Spark Streaming overview in Samza docs

Tathagata Das Tue, 02 Jun 2015 16:47:29 -0700

Thank you very much Yan! I am happy to chat more if you need further
clarifications.


On Tue, Jun 2, 2015 at 3:54 PM, Yan Fang <[email protected]> wrote:

> -- Hi Jay,
>
> Thanks for forwarding this.
>
> -- Hi TD,
>
> Thanks for pointing this out. That overview was written about one year
> ago. It is out-of-date. Good to get some critiques from your side. Will
> update it soon. Created SAMZA-698
> <https://issues.apache.org/jira/browse/SAMZA-698> to track this. Thank
> you.
>
> Cheers,
>
> Fang, Yan
> [email protected]
>
> On Tue, Jun 2, 2015 at 9:40 AM, Jay Kreps <[email protected]> wrote:
>
>> Hey guys,
>>
>> Here were some critiques of our system comparison page from Tathagata at
>> Databricks.
>>
>> -Jay
>>
>> ---------- Forwarded message ----------
>> From: Tathagata Das <[email protected]>
>> Date: Thu, May 14, 2015 at 1:15 PM
>> Subject: About Spark Streaming overview in Samza docs
>> To: Jay Kreps <[email protected]>
>>
>>
>> Hello Jay,
>>
>> I am not sure if you remember me from our earlier (a year or so) phone
>> conversation along with Patrick Wendell, so let me introduce myself. I am
>> Tathagata Das (aka TD), and I am the technical lead behind Spark
>> Streaming.
>> We had chatted earlier about various topics related to Kafka and I hope we
>> can chat more about it some time soon.
>>
>> However, in this mail, I wanted to talk a bit about Samza's description of
>> Spark Streaming
>> <
>> http://samza.apache.org/learn/documentation/0.9/comparisons/spark-streaming.html
>> >.
>> Though I sort of assumed that you are the right person to talk. But that
>> isnt the case, feel free to redirect me to whoever you think is the best
>> person for this.
>>
>> The overview of Spark Streaming is pretty good! I myself would not have
>> been able to put the high-level architecture of Spark Streaming so
>> succinctly. That said, there are a few pieces of information that are
>> outdated and it will be good to update the page to avoid confusion. Here
>> are some of them.
>>
>> 1.* " Spark Streaming does not gurantee at-least-once or at-most-once
>> messaging semantics"* - This is outdated information. In Spark 1.2, we
>> introduced write ahead logs
>> <
>> https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
>> >
>> that can guarantee at least once processing for any reliable source,
>> despite driver and worker failures. In addition, in Spark 1.3 we
>> introduced a
>> new way
>> <
>> https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
>> >
>> to process data from Kafka, such that it achieves end-to-end exactly-once
>> processing if data store updates are idempotent or transactional (BTW, did
>> I say Kafka is *amazing* which allowed us to do this crazy new approach?).
>>
>> 2. *"Spark Streaming may lose data if the failure happens when the data is
>> received but not yet replicated to other nodes (also see SPARK-1647)"* -
>> Again, this has changed in between Spark 1.1 - 1.3. For Flume, we added
>> Flume polling stream that uses Flume transactions to guarantee that data
>> is
>> properly replicated or retransmitted on receiver failure. Driver failures
>> handled by write ahead logs. For Kafka, the new approach does not even
>> need
>> replication as it treats Kafka like a file system, reading segments of log
>> as needed.
>>
>> 3. *"it is unsuitable for nondeterministic processing, e.g. a randomized
>> machine learning algorithm"* - It is incorrect to say that Spark Streaming
>> is unsuitable. We suggest using deterministic operations only to ensure
>> that the developers always get the expected results even if there are
>> failures. Just like MapReduce, there is nothing stopping any user from
>> implementing a non-determinstic algorithm on Spark Streaming, as long as
>> the user is aware of its consequence of fault-tolerance guarantees
>> (results
>> may change due to failures). Furthermore, randomized streaming machine
>> learning algorithms can still be implemented using deterministic
>> transformations (using pseudo random numbers, etc.). There are quite a few
>> random sampling (e.g. RDD.sample()
>> <
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
>> >)
>> and randomized algorithms
>> <
>> https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html
>> >
>> in core Spark and MLlib (Spark's machine learning library), and the same
>> techniques can be used to implement "deterministic" randomized machine
>> learning algorithms on Spark Streaming.
>>
>> 4. *"When a driver node fails in Spark Streaming, Spark’s standalone
>> cluster mode will restart the driver node automatically. But it is
>> currently not supported in YARN and Mesos."* - YARN supports automatically
>> restarting the AM, global default being at most 1 restart (
>> yarn.resourcemanager.am.max-attempts
>> <
>> http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
>> >).
>> On Mesos, applications are often launched using Marathon
>> <https://mesosphere.github.io/marathon/docs/>, which also supports
>> restarting.
>>
>> 5. *"Samza is still young, but has just released version 0.7.0."* -
>> Incorrect ;)
>>
>> Sorry for this long post. I am happy to get on phone/hangout/skype with
>> you
>> if more clarifications are needed. And independent of all this, feel free
>> to email me about anything anytime.
>>
>> Thanks!
>>
>> TD
>>
>
>

Re: About Spark Streaming overview in Samza docs

Reply via email to