I am architecting a platform incorporating: recommender systems, information retrieval (ML), sequence mining, and Natural Language Processing.
Additionally I have the generic CRUD and authentication components, with everything exposed RESTfully. For the storage layer(s), there are a few options which immediately present themselves: Generic CRUD layer (high speed needed here, though I suppose I could use Redis…) - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema SQL layer atop - Apache Spark (perhaps piping to HDFS)… ¿maybe? - MongoDB (or a similar document-store), a graph-database, or even something like Postgres Analytics layer (to enable Big Data / Data-intensive computing features) - Apache Spark - Hadoop with MapReduce and/or utilising some other Apache / non-Apache project with integration - Disco (from Nokia) ________________________________ Should I prefer one layer—e.g.: on HDFS—over multiple disparite layers? - The advantage here is obvious, but I am certain there are disadvantages. (and yes, I know there are various ways; automated and manual; to push data from non HDFS-backed stores to HDFS) Also, as a bonus answer, which stack would you recommend for this user-network I'm building?