Re: What do I loose if I run spark without using HDFS or Zookeeper?

Carlile, Ken Fri, 26 Aug 2016 05:53:33 -0700

We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3.

—Ken

On Aug 25, 2016, at 8:02 PM, kant kodali <kanth...@gmail.com> wrote:

ZFS linux port has got very stable these days given LLNL maintains the linux port and they also use it as a FileSystem for their super computer (The supercomputer is one of the top in the nation is what I heard)

On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote:

How about using ZFS?

On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:

That's often not as important as you might think. It really only affects the loading of data by the first Stage. Subsequent Stages (in the same Job or even in other Jobs if you do it right) will use the map outputs, and will do so with good data locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha <guha.a...@gmail.com> wrote:

At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs.
S3 or NFS will not able to provide that.

On 26 Aug 2016 07:49, "kant kodali" <kanth...@gmail.com> wrote:

yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this.

https://issues.apache.org/jira/browse/MESOS-3797

On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:

Mesos also uses ZK for leader election. There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali <kanth...@gmail.com> wrote:

@Ofir @Sean very good points.

@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many things but for our use case all we need is for high availability and given the devops people frustrations here in our company who had extensive experience managing large clusters in the past we would be very happy to avoid Zookeeper. I also heard that Mesos can provide High Availability through etcd and consul and if that is true I will be left with the following stack

Spark + Mesos scheduler + Distributed File System or to be precise I should say Distributed Storage since S3 is an object store so I guess this will be HDFS for us + etcd & consul. Now the big question for me is how do I set all this up

On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:

Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing

For example, for Spark Streaming, you can not do any window operation in a cluster without checkpointing to HDFS (or S3).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

Hi Kant,

I trust the following would be of use.

Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.

In the heart of it and with reference to points you raised about HDFS, one needs to have a working knowledge of Hadoop Core System including HDFS, Map-reduce algorithm and Yarn whether one uses them or not. After all Big Data is all about horizontal scaling with master and nodes (as opposed to vertical scaling like SQL Server running on a Host). and distributed data (by default data is replicated three times on different nodes for scalability and availability).

Other members including Sean provided the limits on how far one operate Spark in its own space. If you are going to deal with data (data in motion and data at rest), then you will need to interact with some form of storage and HDFS and compatible file systems like S3 are the natural choices.

Zookeeper is not just about high availability. It is used in Spark Streaming with Kafka, it is also used with Hive for concurrency. It is also a distributed locking system.

HTH

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On 25 August 2016 at 20:52, Mark Hamstra <m...@clearstorydata.com> wrote:

s/playing a role/paying a role/

On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com> wrote:

One way you can start to make this make more sense, Sean, is if you exploit the code/data duality so that the non-distributed data that you are sending out from the driver is actually paying a role more like code (or at least parameters.) What is sent from the driver to an Executer is then used (typically as seeds or parameters) to execute some procedure on the Worker node that generates the actual data on the Workers. After that, you proceed to execute in a more typical fashion with Spark using the now-instantiated distributed data.

But I don't get the sense that this meta-programming-ish style is really what the OP was aiming at.

On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote:

Without a distributed storage system, your application can only create data on the driver and send it out to the workers, and collect data back from the workers. You can't read or write data in a distributed way. There are use cases for this, but pretty limited (unless you're running on 1 machine).

I can't really imagine a serious use of (distributed) Spark without (distribute) storage, in a way I don't think many apps exist that don't read/write data.

The premise here is not just replication, but partitioning data across compute resources. With a distributed file system, your big input exists across a bunch of machines and you can send the work to the pieces of data.

On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com> wrote:

@Mich I understand why I would need Zookeeper. It is there for fault tolerance given that spark is a master-slave architecture and when a mater goes down zookeeper will run a leader election algorithm to elect a new leader however DevOps hate Zookeeper they would be much happier to go with etcd & consul and looks like if we mesos scheduler we should be able to drop Zookeeper.

HDFS I am still trying to understand why I would need for spark. I understand the purpose of distributed file systems in general but I don't understand in the context of spark since many people say you can run a spark distributed cluster in a stand alone mode but I am not sure what are its pros/cons if we do it that way. In a hadoop world I understand that one of the reasons HDFS is there is for replication other words if we write some data to a HDFS it will store that block across different nodes such that if one of nodes goes down it can still retrieve that block from other nodes. In the context of spark I am not really sure because 1) I am new 2) Spark paper says it doesn't replicate data instead it stores the lineage(all the transformations) such that it can reconstruct it.

On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com wrote:

You can use Spark on Oracle as a query tool.

It all depends on the mode of the operation.

If you running Spark with yarn-client/cluster then you will need yarn. It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).

I have not gone and installed Yarn without installing Hadoop.

What is the overriding reason to have the Spark on its own?

You can use Spark in Local or Standalone mode if you do not want Hadoop core.

HTH

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On 24 August 2016 at 21:54, kant kodali <kanth...@gmail.com> wrote:

What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

--

Michael Gummelt

Software Engineer

Mesosphere

--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: What do I loose if I run spark without using HDFS or Zookeeper?

Reply via email to