Hi, I'm having a computation held on top of a big dynamic model that is constantly having changes / online updates, therefore, thought that working in batch mode (stateless): s.t. requires of heavy model sent to spark will be less appropriate than working in stream mode. Therefore, was able to have computations in stream mode.... and have the model living in spark and getting live updates required. However, as the main motivation for using spark was complexity reduction/compute speedup given a distributed algorithm. State management is evidently a challenging task in terms of memory resources which i don't want to overwhelm/disrupt computation.... My main Q is: as effective spark may be computation wise (using a corresponding distributed algorithm) is there any grade of control for user with is memory footprint? e.g.: Is splitting its memory between its workers is feasible? or all memory is eventually centralized to spark master (sometime driver, depending on work mode: client or cluster) I'm basically looking for a way to scale out memory-wise and not just compute wise... is the transformation of a centric data-structure into RDDs (which I'm using in compute) may relieve/distribute memory footprint as well? for example: can i split my main memory data structure into set of clusters assigned to each worker etc? thanks a lot, David
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org