Re: What do I loose if I run spark without using HDFS or Zookeeper?

Steve Loughran Fri, 26 Aug 2016 02:51:33 -0700

On 25 Aug 2016, at 22:49, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:

yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.

https://issues.apache.org/jira/browse/MESOS-3797

I worry about any attempt to implement distributed consensus systems: they take
time in production to get right.

1. There's the need to prove that what you are building is valid if the
implementation matches the specification. That has apparently been done for ZK,
though given the complexity of maths involved, I cannot vouch for that myself:
https://blog.acolyer.org/2015/03/09/zab-high-performance-broadcast-for-primary-backup-systems/

2. you need to run it in production to find the problems. Google's Chubby paper
hints about the things they found out went wrong there. As far as ZK goes,
jepsen hints its robust

https://aphyr.com/posts/291-jepsen-zookeeper

If it has weaknesses, I'd point at
- it's security model
-it's lack of helpfulness when there are kerberos/SASL auth problems (ZK
server closes connection; client sees connection failure and retries),
-the fact that it's failure modes aren't always understood by people coding
against it.

http://blog.cloudera.com/blog/2014/03/zookeeper-resilience-at-pinterest/

the Raft algorithm appears to be easier to implement than Paxos; there are
things built on it and I look forward to seeing what works/doesn't work in
production.

Certainly Aphyr found problems when it pointed jepsen at etcd, though being a
2014 piece of work, I expect those specific problems to have been addressed.
The main thing is: it shows how hard it is to get things right in the presence
of complex failures.

Finally, regarding S3

You can use S3 object store as a source of data in queries/streaming, and, if
done carefully, a destination. Performance is variable...something some of us
are working on there, across S3a, spark and hive.

Conference placement: I shall be talking on that topic at Spark Summit Europe
if you want to find out more: https://spark-summit.org/eu-2016/

On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt
mgumm...@mesosphere.io<mailto:mgumm...@mesosphere.io> wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress:
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
@Ofir @Sean very good points.

@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper.
I also heard that Mesos can provide High Availability through etcd and consul
and if that is true I will be left with the following stack

Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS
for us + etcd & consul. Now the big question for me is how do I set all this up
[https://dv4jgpe7xb4ws.cloudfront.net/v1/simple_smile.png]

Re: What do I loose if I run spark without using HDFS or Zookeeper?

Reply via email to