Re: What do I loose if I run spark without using HDFS or Zookeeper?

Mark Hamstra Thu, 25 Aug 2016 12:53:14 -0700

s/playing a role/paying a role/

On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:


> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is actually paying a role more like code (or at
> least parameters.)  What is sent from the driver to an Executer is then
> used (typically as seeds or parameters) to execute some procedure on the
> Worker node that generates the actual data on the Workers.  After that, you
> proceed to execute in a more typical fashion with Spark using the
> now-instantiated distributed data.
>
> But I don't get the sense that this meta-programming-ish style is really
> what the OP was aiming at.
>
> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to the pieces of data.
>>
>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>> tolerance given that spark is a master-slave architecture and when a mater
>>> goes down zookeeper will run a leader election algorithm to elect a new
>>> leader however DevOps hate Zookeeper they would be much happier to go with
>>> etcd & consul and looks like if we mesos scheduler we should be able to
>>> drop Zookeeper.
>>>
>>> HDFS I am still trying to understand why I would need for spark. I
>>> understand the purpose of distributed file systems in general but I don't
>>> understand in the context of spark since many people say you can run a
>>> spark distributed cluster in a stand alone mode but I am not sure what are
>>> its pros/cons if we do it that way. In a hadoop world I understand that one
>>> of the reasons HDFS is there is for replication other words if we write
>>> some data to a HDFS it will store that block across different nodes such
>>> that if one of nodes goes down it can still retrieve that block from other
>>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>>> Spark paper says it doesn't replicate data instead it stores the
>>> lineage(all the transformations) such that it can reconstruct it.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>>> wrote:
>>>
>>>> You can use Spark on Oracle as a query tool.
>>>>
>>>> It all depends on the mode of the operation.
>>>>
>>>> If you running Spark with yarn-client/cluster then you will need yarn.
>>>> It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>>>
>>>> I have not gone and installed Yarn without installing Hadoop.
>>>>
>>>> What is the overriding reason to have the Spark on its own?
>>>>
>>>>  You can use Spark in Local or Standalone mode if you do not want
>>>> Hadoop core.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 24 August 2016 at 21:54, kant kodali <kanth...@gmail.com> wrote:
>>>>
>>>> What do I loose if I run spark without using HDFS or Zookeper ? which
>>>> of them is almost a must in practice?
>>>>
>>>>
>>>>
>>
>

Re: What do I loose if I run spark without using HDFS or Zookeeper?

Reply via email to