Hi Rob,
I fear your questions will be hard to answer without additional information
about what kind of simulations you plan to do. int[r][c] basically means
you have a matrix of integers? You could for example map this to a
row-oriented RDD of integer-arrays or to a column oriented RDD of integer
arrays. What the better option is will heavily depend on your workload.
Also have a look at the algebaraic data-structures that come with mllib (
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors
).
Regards,
Jeff
2015-02-25 23:58 GMT+01:00 Rob Sargent rob.sarg...@utah.edu:
I have an application which might benefit from Sparks
distribution/analysis, but I'm worried about the size and structure of my
data set. I need to perform several thousand simulation on a rather large
data set and I need access to all the generated simulations. The data
element is largely in int[r][c] where r is 100 to 1000 and c is 20-80K
(there's more but that array is the bulk of the problem. I have machines
and memory capable of doing 6-10 simulations simultaneously in separate
jvms. Is this data structure compatible with Sparks RDD notion?
If yes, I will have a slough of how-to-get-started questions, the first of
which is how to seed the run? My thinking is to use
org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and
the seed data. Would that be the way to go?
Thanks