Re: When to use underlying data management layer versus standalone Spark?

2015-06-24 Thread Sandy Ryza
Hi Michael,

Spark itself is an execution engine, not a storage system.  While it has
facilities for caching data in memory, think about these the way you would
think about a process on a single machine leveraging memory - the source
data needs to be stored somewhere, and you need to be able to access it
quickly in case there's a failure.

To echo what Sonal said, it depends on the needs of your application.  If
you expect to mostly write jobs that read and write data in batch, storing
data on HDFS in a binary format like Avro or Parquet will give you the bet
performance.  If other systems need random access to your data, you'd want
to consider a system like HBase and Cassandra, though these are likely to
suffer a little bit on performance and incur higher operational overhead.

-Sandy

On Tue, Jun 23, 2015 at 11:21 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 When you deploy spark over hadoop, you typically want to leverage the
 replication of hdfs or your data is already in hadoop. Again, if your data
 is already in Cassandra or if you want to do updateable atomic row
 operations and access to your data as well as run analytic jobs, that may
 be another case.
 On Jun 24, 2015 1:17 AM, commtech michael.leon...@opco.com wrote:

 Hi,

 I work at a large financial institution in New York. We're looking into
 Spark and trying to learn more about the deployment/use cases for
 real-time
 analytics with Spark. When would it be better to deploy standalone Spark
 versus Spark on top of a more comprehensive data management layer (Hadoop,
 Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are
 there different use cases where one of these database management layers
 are
 better or worse?

 Any color would be very helpful. Thank you in advance.

 Sincerely,
 Michael





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread commtech
Hi,

I work at a large financial institution in New York. We're looking into
Spark and trying to learn more about the deployment/use cases for real-time
analytics with Spark. When would it be better to deploy standalone Spark
versus Spark on top of a more comprehensive data management layer (Hadoop,
Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are
there different use cases where one of these database management layers are
better or worse?

Any color would be very helpful. Thank you in advance.

Sincerely,
Michael





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread canan chen
I don't think this is the correct question.  Spark can be deployed on
different cluster manager frameworks like standard alone, yarn  mesos.
Spark can't run without these cluster manager framework, that means spark
depend on cluster manager framework.

And the data management layer is the upstream of spark which is independent
with spark. But spark do provide apis to access different data management
layer.
It should depend on your upstream application which data store should use,
it's not related with spark.


On Wed, Jun 24, 2015 at 3:46 AM, commtech michael.leon...@opco.com wrote:

 Hi,

 I work at a large financial institution in New York. We're looking into
 Spark and trying to learn more about the deployment/use cases for real-time
 analytics with Spark. When would it be better to deploy standalone Spark
 versus Spark on top of a more comprehensive data management layer (Hadoop,
 Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are
 there different use cases where one of these database management layers are
 better or worse?

 Any color would be very helpful. Thank you in advance.

 Sincerely,
 Michael





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org