Re: distributing large matrices

2015-08-14 Thread Rob Sargent

@Koen,

If you meant to reply to my question on distributing matrices, could you 
re-send as there was not content in your post.


Thanks,

On 08/07/2015 10:02 AM, Koen Vantomme wrote:


Verzonden vanaf mijn Sony Xperia™-smartphone



 iceback schreef 

Is this the sort of problem spark can accommodate?

I need to compare 10,000 matrices with each other (10^10 comparison).  The
matrices are 100x10 (10^7 int values).
I have 10 machines with 2 to 8 cores (8-32 processors).
All machines have to
- contribute to matrices generation (a simulation, takes seconds)
- see all matrices
- compare matrices (takes very little time compared to simulation)

I expect to persist the simulations, have spark push them to processors.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distributing-large-matrices-tp24174.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com http://Nabble.com.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org 
mailto:user-h...@spark.apache.org






Considering Spark for large data elements

2015-02-25 Thread Rob Sargent
I have an application which might benefit from Sparks 
distribution/analysis, but I'm worried about the size and structure of 
my data set.  I need to perform several thousand simulation on a rather 
large data set and I need access to all the generated simulations.  The 
data element is largely in int[r][c] where r is 100 to 1000 and c is 
20-80K (there's more but that array is the bulk of the problem.  I have 
machines and memory capable of doing 6-10 simulations simultaneously in 
separate jvms.  Is this data structure compatible with Sparks RDD notion?


If yes, I will have a slough of how-to-get-started questions, the first 
of which is how to seed the run?  My thinking is to use 
org.apache.spark.api.java.FlatMapFunction starting with an EmptyRDD and 
the seed data.  Would that be the way to go?


Thanks