At the end of the day, the more data that is pulled from multiple physical
nodes, the (relatively) slower your response time to respond to queries.
Until you reach a point where that response time exceeds your business
requirements, keep it simple. As volumes grow with distributed data sources
to feed the queries, you will need to begin considering a relational or
pseudorelational architecture "on top of" Hadoop.

The driving questions tend to be:

Does the mix of queries access the entire range of base data across the
cluster?
How much latency between receipt of (raw) new data, processing of that data
into the SQL/NotOnlySQL repository,  and delivery of the full mix of
results to your spectrum of users is permissible?

Can you move some mix of queries to a somewhat out of date repository that
is refreshed e.g. daily?

At the end of the day, complexity of your business data requirements and
the complexity of the process drive one to more layers and complex
solutions.




*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 20, 2015 at 9:12 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. Is Apache Spark good as a general database
>
> I don't think Spark itself is a general database though there're
> connectors to various NoSQL databases, including HBase.
>
> bq. using their graph database features?
>
> Sure. Take a look at http://spark.apache.org/graphx/
>
> Cheers
>
> On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <alec.tayl...@gmail.com>
> wrote:
>
>> Small amounts in a one node cluster (at first).
>>
>> As it scales I'll be looking at running various O(nk) algorithms,
>> where n is the number of distinct users and k are the overlapping
>> features I want to consider.
>>
>> Is Apache Spark good as a general database as well as it's more fancy
>> features? - E.g.: considering I'm building a network, maybe using
>> their graph database features?
>>
>> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> > Apache Spark supports integration with HBase (which has REST API).
>> >
>> > What's the amount of data you want to store in this system ?
>> >
>> > Cheers
>> >
>> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <alec.tayl...@gmail.com>
>> wrote:
>> >>
>> >> I am architecting a platform incorporating: recommender systems,
>> >> information retrieval (ML), sequence mining, and Natural Language
>> >> Processing.
>> >>
>> >> Additionally I have the generic CRUD and authentication components,
>> >> with everything exposed RESTfully.
>> >>
>> >> For the storage layer(s), there are a few options which immediately
>> >> present themselves:
>> >>
>> >> Generic CRUD layer (high speed needed here, though I suppose I could
>> use
>> >> Redis…)
>> >>
>> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> >> SQL layer atop
>> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> >> - MongoDB (or a similar document-store), a graph-database, or even
>> >> something like Postgres
>> >>
>> >> Analytics layer (to enable Big Data / Data-intensive computing
>> features)
>> >>
>> >> - Apache Spark
>> >> - Hadoop with MapReduce and/or utilising some other Apache /
>> >> non-Apache project with integration
>> >> - Disco (from Nokia)
>> >>
>> >> ________________________________
>> >>
>> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> >> layers? - The advantage here is obvious, but I am certain there are
>> >> disadvantages. (and yes, I know there are various ways; automated and
>> >> manual; to push data from non HDFS-backed stores to HDFS)
>> >>
>> >> Also, as a bonus answer, which stack would you recommend for this
>> >> user-network I'm building?
>> >
>> >
>>
>
>

Reply via email to