Hi ,I'm pretty new to Big Data & Spark both. I've just started POC work on spark and me & my team are evaluating it with other In Memory computing tools such as GridGain, Bigmemory, Aerospike & some others too, specifically to solve two sets of problems.1) Data Storage : Our current application runs on a single node which is a heavy configuration of 24 cores & 350Geg, our application loads all the datamart data inclusive of multiple cubes into the memory & converts it and keeps it in a Trove Collection in a form of Key / Value map. This collection is a immutable collection which takes about 15-20 Gegs of memory space. Our anticipation is that the data would grow 10-15 folds in the next year or so & we are not very confident of Trove being able to scale to that level.2) Compute: Ours in a natively Analytical application doing predictive analytics with lots of simulations and optimizations of scenarios, at the heart of all this are the Trove Collections using which we perform our Mathematical algorithms to calculate the end result, in doing so, the memory consumption of the application goes beyond 250-300Geg. These are because of lots of intermediate computing results ( collections ) which are further broken down to the granular level and then searched in the Trove collection. All this happens on a single node which obviously starts to perform slowly over a period of time. And based on the large volume of data incoming in the next year or so, our current architecture will not be able to handle such massive In Memory data set & such computing power. Hence we are targeting to change the architecture to a cluster based in memory distributed computing. We are evaluating all these products along with Apache Spark. We were very excited by Apache spark looking at the videos and some online resources, but when it came down to doing handson we are facing lots of issues.1)What are Standalone Cluster's limitations ? Can i configure a Cluster on a Single Node with Multiple Processes of Worker Nodes, Executors etc. ? Is this supported even though the IP Address would be the same ? 2) Why so many Java Processes ? Why are there so many Java Processes ? Worker Nodes - Executors ? Will the communication between them not slow down the performance on a whole ?3) How is Parallelism on Partitioned Data achieved ? This one is really important for us to understand, since are doing our benchmarkings on Partitioned data, We do not know how to configure Partitions on Spark ? Any help here would be appreciated. We want to partition data present in Cubes, hence we want Each Cube to be a separate partition.4) What is the difference between Multiple Nodes executing Jobs & Multiple Tasks Executing Jobs ? How do these handle the partitioning & parallelism. Help in these questions would be really appreciated, to get a better sense of Apache Spark.Thanks,Nitin
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477.html Sent from the Apache Spark User List mailing list archive at Nabble.com.