Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-27 Thread kant kodali
I understand now that for I cannot use spark streaming window operation without checkpointing to HDFS as pointed out by @Ofir but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran
On 26 Aug 2016, at 12:58, kant kodali > wrote: @Steve your arguments make sense however there is a good majority of people who have extensive experience with zookeeper prefer to avoid zookeeper and given the ease of consul (which btw uses raft for

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
@Mich ofcourse and In my previous message I have given a context as well. Needless to say, the tools that are used by many banks that I came across such as Citi, Capital One, Wells Fargo, GSachs are pretty laughable when it comes to compliance and security. They somehow think they are secure when

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Carlile, Ken
We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3. —Ken On Aug 25, 2016, at 8:02 PM,

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Mich Talebzadeh
And yes any technology needs time for maturity but that said it shouldn't stop us from transitioning Depends on the application and how mission critical the business it is deployed for. If you are using a tool for a Bank's Credit Risk (Surveillance, Anti-Money Laundering, Employee

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
@Steve your arguments make sense however there is a good majority of people who have extensive experience with zookeeper prefer to avoid zookeeper and given the ease of consul (which btw uses raft for the election) and etcd lot of us are more inclined to avoid ZK. And yes any technology needs

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Steve Loughran
On 25 Aug 2016, at 22:49, kant kodali > wrote: yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this.

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
ZFS linux port has got very stable these days given LLNL maintains the linux port and they also use it as a FileSystem for their super computer (The supercomputer is one of the top in the nation is what I heard) On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote: How about

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
How about using ZFS? On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote: That's often not as important as you might think. It really only affects the loading of data by the first Stage. Subsequent Stages (in the same Job or even in other Jobs if you do it right) will

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
That's often not as important as you might think. It really only affects the loading of data by the first Stage. Subsequent Stages (in the same Job or even in other Jobs if you do it right) will use the map outputs, and will do so with good data locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
> You would lose the ability to process data closest to where it resides if you do not use hdfs. This isn't true. Many other data sources (e.g. Cassandra) support locality. On Thu, Aug 25, 2016 at 3:36 PM, ayan guha wrote: > At the core of it map reduce relies heavily on

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread ayan guha
At the core of it map reduce relies heavily on data locality. You would lose the ability to process data closest to where it resides if you do not use hdfs. S3 or NFS will not able to provide that. On 26 Aug 2016 07:49, "kant kodali" wrote: > yeah so its seems like its work

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. https://issues.apache.org/jira/browse/MESOS-3797 On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote: Mesos

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
Mesos also uses ZK for leader election. There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806 On Thu, Aug 25, 2016 at 1:55 PM, kant kodali wrote: > @Ofir @Sean very good points. > > @Mike We dont use Kafka

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
@Ofir @Sean very good points. @Mike We dont use Kafka or Hive and I understand that Zookeeper can do many things but for our use case all we need is for high availability and given the devops people frustrations here in our company who had extensive experience managing large clusters in the past

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Ofir Manor
Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing For example, for Spark Streaming, you can not do any window operation in a cluster without checkpointing to HDFS (or S3).

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
Hi Kant, I trust the following would be of use. Big Data depends on Hadoop Ecosystem from whichever angle one looks at it. In the heart of it and with reference to points you raised about HDFS, one needs to have a working knowledge of Hadoop Core System including HDFS, Map-reduce algorithm and

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
s/playing a role/paying a role/ On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra wrote: > One way you can start to make this make more sense, Sean, is if you > exploit the code/data duality so that the non-distributed data that you are > sending out from the driver is

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit the code/data duality so that the non-distributed data that you are sending out from the driver is actually paying a role more like code (or at least parameters.) What is sent from the driver to an Executer is then used

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen
Without a distributed storage system, your application can only create data on the driver and send it out to the workers, and collect data back from the workers. You can't read or write data in a distributed way. There are use cases for this, but pretty limited (unless you're running on 1

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
@Mich I understand why I would need Zookeeper. It is there for fault tolerance given that spark is a master-slave architecture and when a mater goes down zookeeper will run a leader election algorithm to elect a new leader however DevOps hate Zookeeper they would be much happier to go with etcd &

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
You can use Spark on Oracle as a query tool. It all depends on the mode of the operation. If you running Spark with yarn-client/cluster then you will need yarn. It comes as part of Hadoop core (HDFS, Map-reduce and Yarn). I have not gone and installed Yarn without installing Hadoop. What is

What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-24 Thread kant kodali
What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?