Considering Spark for large data elements

Rob Sargent Wed, 25 Feb 2015 14:59:54 -0800

I have an application which might benefit from Sparksdistribution/analysis, but I'm worried about the size and structure ofmy data set. I need to perform several thousand simulation on a ratherlarge data set and I need access to all the generated simulations. Thedata element is largely in int[r][c] where r is 100 to 1000 and c is20-80K (there's more but that array is the bulk of the problem. I havemachines and memory capable of doing 6-10 simulations simultaneously inseparate jvms. Is this data structure compatible with Sparks RDD notion?

If yes, I will have a slough of how-to-get-started questions, the firstof which is how to seed the run? My thinking is to useorg.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD andthe seed data. Would that be the way to go?


Thanks

Considering Spark for large data elements

Reply via email to