I have an application which might benefit from Sparks distribution/analysis, but I'm worried about the size and structure of my data set. I need to perform several thousand simulation on a rather large data set and I need access to all the generated simulations. The data element is largely in int[r][c] where r is 100 to 1000 and c is 20-80K (there's more but that array is the bulk of the problem. I have machines and memory capable of doing 6-10 simulations simultaneously in separate jvms. Is this data structure compatible with Sparks RDD notion?

If yes, I will have a slough of how-to-get-started questions, the first of which is how to seed the run? My thinking is to use org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and the seed data. Would that be the way to go?

Thanks

Reply via email to