Re: What do I loose if I run spark without using HDFS or Zookeeper?

Michael Gummelt Thu, 25 Aug 2016 14:00:53 -0700

Mesos also uses ZK for leader election.  There seems to be some effort in
supporting etcd, but it's in progress:
https://issues.apache.org/jira/browse/MESOS-1806


On Thu, Aug 25, 2016 at 1:55 PM, kant kodali <kanth...@gmail.com> wrote:

> @Ofir @Sean very good points.
>
> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
> many things but for our use case all we need is for high availability and
> given the devops people frustrations here in our company who had extensive
> experience managing large clusters in the past we would be very happy to
> avoid Zookeeper. I also heard that Mesos can provide High Availability
> through etcd and consul and if that is true I will be left with the
> following stack
>
> Spark + Mesos scheduler + Distributed File System or to be precise I
> should say Distributed Storage since S3 is an object store so I guess this
> will be HDFS for us + etcd & consul. Now the big question for me is how do
> I set all this up
>
>
>
> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>
>> Just to add one concrete example regarding HDFS dependency.
>> Have a look at checkpointing https://spark.apache.org/docs/1.6.2/
>> streaming-programming-guide.html#checkpointing
>> For example, for Spark Streaming, you can not do any window operation in
>> a cluster without checkpointing to HDFS (or S3).
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi Kant,
>>
>> I trust the following would be of use.
>>
>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>>
>> In the heart of it and with reference to points you raised about HDFS,
>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>> Data is all about horizontal scaling with master and nodes (as opposed to
>> vertical scaling like SQL Server running on a Host). and distributed data
>> (by default data is replicated three times on different nodes for
>> scalability and availability).
>>
>> Other members including Sean provided the limits on how far one operate
>> Spark in its own space. If you are going to deal with data (data in motion
>> and data at rest), then you will need to interact with some form of storage
>> and HDFS and compatible file systems like S3 are the natural choices.
>>
>> Zookeeper is not just about high availability. It is used in Spark
>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>> a distributed locking system.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 20:52, Mark Hamstra <m...@clearstorydata.com> wrote:
>>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to the pieces of data.
>>
>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>> @Mich I understand why I would need Zookeeper. It is there for fault
>> tolerance given that spark is a master-slave architecture and when a mater
>> goes down zookeeper will run a leader election algorithm to elect a new
>> leader however DevOps hate Zookeeper they would be much happier to go with
>> etcd & consul and looks like if we mesos scheduler we should be able to
>> drop Zookeeper.
>>
>> HDFS I am still trying to understand why I would need for spark. I
>> understand the purpose of distributed file systems in general but I don't
>> understand in the context of spark since many people say you can run a
>> spark distributed cluster in a stand alone mode but I am not sure what are
>> its pros/cons if we do it that way. In a hadoop world I understand that one
>> of the reasons HDFS is there is for replication other words if we write
>> some data to a HDFS it will store that block across different nodes such
>> that if one of nodes goes down it can still retrieve that block from other
>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>> Spark paper says it doesn't replicate data instead it stores the
>> lineage(all the transformations) such that it can reconstruct it.
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>> wrote:
>>
>> You can use Spark on Oracle as a query tool.
>>
>> It all depends on the mode of the operation.
>>
>> If you running Spark with yarn-client/cluster then you will need yarn. It
>> comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>
>> I have not gone and installed Yarn without installing Hadoop.
>>
>> What is the overriding reason to have the Spark on its own?
>>
>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>> core.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 21:54, kant kodali <kanth...@gmail.com> wrote:
>>
>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>> them is almost a must in practice?
>>
>>
>>
>>
>>
>>
>>
>>


-- 
Michael Gummelt
Software Engineer
Mesosphere

Re: What do I loose if I run spark without using HDFS or Zookeeper?

Reply via email to