I understand now that for I cannot use spark streaming window operation without
checkpointing to HDFS as pointed out by @Ofir but Without window operation I
don't think we can do much with spark streaming. so since it is very essential
can I use Cassandra as a distributed storage? If so, can I see
On 26 Aug 2016, at 12:58, kant kodali
> wrote:
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given
the ease of consul (which btw uses raft for
@Mich ofcourse and In my previous message I have given a context as well.
Needless to say, the tools that are used by many banks that I came across such
as Citi, Capital One, Wells Fargo, GSachs are pretty laughable when it comes to
compliance and security. They somehow think they are secure when
We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3.
—Ken
On Aug 25, 2016, at 8:02 PM,
And yes any technology needs time for maturity but that said it shouldn't
stop us from transitioning
Depends on the application and how mission critical the business it is
deployed for. If you are using a tool for a Bank's Credit Risk
(Surveillance, Anti-Money Laundering, Employee
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given the
ease of consul (which btw uses raft for the election) and etcd lot of us are
more inclined to avoid ZK.
And yes any technology needs
On 25 Aug 2016, at 22:49, kant kodali
> wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
ZFS linux port has got very stable these days given LLNL maintains the linux
port and they also use it as a FileSystem for their super computer (The
supercomputer is one of the top in the nation is what I heard)
On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote:
How about
How about using ZFS?
On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:
That's often not as important as you might think. It really only affects the
loading of data by the first Stage. Subsequent Stages (in the same Job or even
in other Jobs if you do it right) will
That's often not as important as you might think. It really only affects
the loading of data by the first Stage. Subsequent Stages (in the same Job
or even in other Jobs if you do it right) will use the map outputs, and
will do so with good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan
> You would lose the ability to process data closest to where it resides if
you do not use hdfs.
This isn't true. Many other data sources (e.g. Cassandra) support locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha wrote:
> At the core of it map reduce relies heavily on
At the core of it map reduce relies heavily on data locality. You would
lose the ability to process data closest to where it resides if you do not
use hdfs.
S3 or NFS will not able to provide that.
On 26 Aug 2016 07:49, "kant kodali" wrote:
> yeah so its seems like its work
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira/browse/MESOS-3797
On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress:
https://issues.apache.org/jira/browse/MESOS-1806
On Thu, Aug 25, 2016 at 1:55 PM, kant kodali wrote:
> @Ofir @Sean very good points.
>
> @Mike We dont use Kafka
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past
Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a
cluster without checkpointing to HDFS (or S3).
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one
needs to have a working knowledge of Hadoop Core System including HDFS,
Map-reduce algorithm and
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra
wrote:
> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.) What is sent from the driver to an Executer is then used
Without a distributed storage system, your application can only create data
on the driver and send it out to the workers, and collect data back from
the workers. You can't read or write data in a distributed way. There are
use cases for this, but pretty limited (unless you're running on 1
@Mich I understand why I would need Zookeeper. It is there for fault tolerance
given that spark is a master-slave architecture and when a mater goes down
zookeeper will run a leader election algorithm to elect a new leader however
DevOps hate Zookeeper they would be much happier to go with etcd &
Spark is a parallel computing framework. There are many ways to give it
data to chomp down on. If you don't know why you would need HDFS, then you
don't need it. Same goes for Zookeeper. Spark works fine without either.
Much of what we read online comes from people with specialized problems
You can use Spark on Oracle as a query tool.
It all depends on the mode of the operation.
If you running Spark with yarn-client/cluster then you will need yarn. It
comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
I have not gone and installed Yarn without installing Hadoop.
What is
What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?
24 matches
Mail list logo