I have an application which might benefit from Sparks
distribution/analysis, but I'm worried about the size and structure of
my data set. I need to perform several thousand simulation on a rather
large data set and I need access to all the generated simulations. The
data element is largely in int[r][c] where r is 100 to 1000 and c is
20-80K (there's more but that array is the bulk of the problem. I have
machines and memory capable of doing 6-10 simulations simultaneously in
separate jvms. Is this data structure compatible with Sparks RDD notion?
If yes, I will have a slough of how-to-get-started questions, the first
of which is how to seed the run? My thinking is to use
org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and
the seed data. Would that be the way to go?
Thanks
- Considering Spark for large data elements Rob Sargent
-