Re: Considering Spark for large data elements

2015-02-26 Thread Jeffrey Jedele
Hi Rob,
I fear your questions will be hard to answer without additional information
about what kind of simulations you plan to do. int[r][c] basically means
you have a matrix of integers? You could for example map this to a
row-oriented RDD of integer-arrays or to a column oriented RDD of integer
arrays. What the better option is will heavily depend on your workload.
Also have a look at the algebaraic data-structures that come with mllib (
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors
).

Regards,
Jeff

2015-02-25 23:58 GMT+01:00 Rob Sargent rob.sarg...@utah.edu:

  I have an application which might benefit from Sparks
 distribution/analysis, but I'm worried about the size and structure of my
 data set.  I need to perform several thousand simulation on a rather large
 data set and I need access to all the generated simulations.  The data
 element is largely in int[r][c] where r is 100 to 1000 and c is 20-80K
 (there's more but that array is the bulk of the problem.  I have machines
 and memory capable of doing 6-10 simulations simultaneously in separate
 jvms.  Is this data structure compatible with Sparks RDD notion?

 If yes, I will have a slough of how-to-get-started questions, the first of
 which is how to seed the run?  My thinking is to use
 org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and
 the seed data.  Would that be the way to go?

 Thanks



Considering Spark for large data elements

2015-02-25 Thread Rob Sargent
I have an application which might benefit from Sparks 
distribution/analysis, but I'm worried about the size and structure of 
my data set.  I need to perform several thousand simulation on a rather 
large data set and I need access to all the generated simulations.  The 
data element is largely in int[r][c] where r is 100 to 1000 and c is 
20-80K (there's more but that array is the bulk of the problem.  I have 
machines and memory capable of doing 6-10 simulations simultaneously in 
separate jvms.  Is this data structure compatible with Sparks RDD notion?


If yes, I will have a slough of how-to-get-started questions, the first 
of which is how to seed the run?  My thinking is to use 
org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and 
the seed data.  Would that be the way to go?


Thanks