Mesos also uses ZK for leader election. There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806
On Thu, Aug 25, 2016 at 1:55 PM, kant kodali <kanth...@gmail.com> wrote: > @Ofir @Sean very good points. > > @Mike We dont use Kafka or Hive and I understand that Zookeeper can do > many things but for our use case all we need is for high availability and > given the devops people frustrations here in our company who had extensive > experience managing large clusters in the past we would be very happy to > avoid Zookeeper. I also heard that Mesos can provide High Availability > through etcd and consul and if that is true I will be left with the > following stack > > Spark + Mesos scheduler + Distributed File System or to be precise I > should say Distributed Storage since S3 is an object store so I guess this > will be HDFS for us + etcd & consul. Now the big question for me is how do > I set all this up > > > > On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote: > >> Just to add one concrete example regarding HDFS dependency. >> Have a look at checkpointing https://spark.apache.org/docs/1.6.2/ >> streaming-programming-guide.html#checkpointing >> For example, for Spark Streaming, you can not do any window operation in >> a cluster without checkpointing to HDFS (or S3). >> >> Ofir Manor >> >> Co-Founder & CTO | Equalum >> >> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io >> >> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >> Hi Kant, >> >> I trust the following would be of use. >> >> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it. >> >> In the heart of it and with reference to points you raised about HDFS, >> one needs to have a working knowledge of Hadoop Core System including HDFS, >> Map-reduce algorithm and Yarn whether one uses them or not. After all Big >> Data is all about horizontal scaling with master and nodes (as opposed to >> vertical scaling like SQL Server running on a Host). and distributed data >> (by default data is replicated three times on different nodes for >> scalability and availability). >> >> Other members including Sean provided the limits on how far one operate >> Spark in its own space. If you are going to deal with data (data in motion >> and data at rest), then you will need to interact with some form of storage >> and HDFS and compatible file systems like S3 are the natural choices. >> >> Zookeeper is not just about high availability. It is used in Spark >> Streaming with Kafka, it is also used with Hive for concurrency. It is also >> a distributed locking system. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 25 August 2016 at 20:52, Mark Hamstra <m...@clearstorydata.com> wrote: >> >> s/playing a role/paying a role/ >> >> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >> One way you can start to make this make more sense, Sean, is if you >> exploit the code/data duality so that the non-distributed data that you are >> sending out from the driver is actually paying a role more like code (or at >> least parameters.) What is sent from the driver to an Executer is then >> used (typically as seeds or parameters) to execute some procedure on the >> Worker node that generates the actual data on the Workers. After that, you >> proceed to execute in a more typical fashion with Spark using the >> now-instantiated distributed data. >> >> But I don't get the sense that this meta-programming-ish style is really >> what the OP was aiming at. >> >> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote: >> >> Without a distributed storage system, your application can only create >> data on the driver and send it out to the workers, and collect data back >> from the workers. You can't read or write data in a distributed way. There >> are use cases for this, but pretty limited (unless you're running on 1 >> machine). >> >> I can't really imagine a serious use of (distributed) Spark without >> (distribute) storage, in a way I don't think many apps exist that don't >> read/write data. >> >> The premise here is not just replication, but partitioning data across >> compute resources. With a distributed file system, your big input exists >> across a bunch of machines and you can send the work to the pieces of data. >> >> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com> wrote: >> >> @Mich I understand why I would need Zookeeper. It is there for fault >> tolerance given that spark is a master-slave architecture and when a mater >> goes down zookeeper will run a leader election algorithm to elect a new >> leader however DevOps hate Zookeeper they would be much happier to go with >> etcd & consul and looks like if we mesos scheduler we should be able to >> drop Zookeeper. >> >> HDFS I am still trying to understand why I would need for spark. I >> understand the purpose of distributed file systems in general but I don't >> understand in the context of spark since many people say you can run a >> spark distributed cluster in a stand alone mode but I am not sure what are >> its pros/cons if we do it that way. In a hadoop world I understand that one >> of the reasons HDFS is there is for replication other words if we write >> some data to a HDFS it will store that block across different nodes such >> that if one of nodes goes down it can still retrieve that block from other >> nodes. In the context of spark I am not really sure because 1) I am new 2) >> Spark paper says it doesn't replicate data instead it stores the >> lineage(all the transformations) such that it can reconstruct it. >> >> >> >> >> >> >> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com >> wrote: >> >> You can use Spark on Oracle as a query tool. >> >> It all depends on the mode of the operation. >> >> If you running Spark with yarn-client/cluster then you will need yarn. It >> comes as part of Hadoop core (HDFS, Map-reduce and Yarn). >> >> I have not gone and installed Yarn without installing Hadoop. >> >> What is the overriding reason to have the Spark on its own? >> >> You can use Spark in Local or Standalone mode if you do not want Hadoop >> core. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 24 August 2016 at 21:54, kant kodali <kanth...@gmail.com> wrote: >> >> What do I loose if I run spark without using HDFS or Zookeper ? which of >> them is almost a must in practice? >> >> >> >> >> >> >> >> -- Michael Gummelt Software Engineer Mesosphere