Hi Michael, Spark itself is an execution engine, not a storage system. While it has facilities for caching data in memory, think about these the way you would think about a process on a single machine leveraging memory - the source data needs to be stored somewhere, and you need to be able to access it quickly in case there's a failure.
To echo what Sonal said, it depends on the needs of your application. If you expect to mostly write jobs that read and write data in batch, storing data on HDFS in a binary format like Avro or Parquet will give you the bet performance. If other systems need random access to your data, you'd want to consider a system like HBase and Cassandra, though these are likely to suffer a little bit on performance and incur higher operational overhead. -Sandy On Tue, Jun 23, 2015 at 11:21 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > When you deploy spark over hadoop, you typically want to leverage the > replication of hdfs or your data is already in hadoop. Again, if your data > is already in Cassandra or if you want to do updateable atomic row > operations and access to your data as well as run analytic jobs, that may > be another case. > On Jun 24, 2015 1:17 AM, "commtech" <michael.leon...@opco.com> wrote: > >> Hi, >> >> I work at a large financial institution in New York. We're looking into >> Spark and trying to learn more about the deployment/use cases for >> real-time >> analytics with Spark. When would it be better to deploy standalone Spark >> versus Spark on top of a more comprehensive data management layer (Hadoop, >> Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are >> there different use cases where one of these database management layers >> are >> better or worse? >> >> Any color would be very helpful. Thank you in advance. >> >> Sincerely, >> Michael >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>