Hi Gautham, I am a beginner spark user too and I may not have a complete understanding of your question, but I thought I would start a discussion anyway. Have you looked into using Spark's built in Correlation function? ( https://spark.apache.org/docs/latest/ml-statistics.html) This might let you get what you want (per-row correlation against the same matrix) without having to deal with parallelizing the computation yourself. Also, I think the question of how quick you can get your results is largely a data access question vs how fast is Spark question. As long as you can exploit data parallelism (i.e. you can partition up your data), Spark will give you a speedup. You can imagine that if you had a large machine with many cores and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could fit your problem in main memory and perform your computation with thread based parallelism. This might get your result relatively quickly. For a dedicated application with well constrained memory and compute requirements, it might not be a bad option to do everything on one machine as well. Accessing an external database and distributing work over a large number of computers can add overhead that might be out of your control.
Thanks, Steven On Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya <gauth...@alleninstitute.org> wrote: > Ping? I would really appreciate advice on this! Thank you! > > > > *From:* Gautham Acharya > *Sent:* Tuesday, July 9, 2019 4:22 PM > *To:* user@spark.apache.org > *Subject:* [Beginner] Run compute on large matrices and return the result > in seconds? > > > > This is my first email to this mailing list, so I apologize if I made any > errors. > > > > My team's going to be building an application and I'm investigating some > options for distributed compute systems. We want to be performing computes > on large matrices. > > > > The requirements are as follows: > > > > 1. The matrices can be expected to be up to 50,000 columns x 3 > million rows. The values are all integers (except for the row/column > headers). > > 2. The application needs to select a specific row, and calculate the > correlation coefficient ( > https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html > ) > against every other row. This means up to 3 million different calculations. > > 3. A sorted list of the correlation coefficients and their > corresponding row keys need to be returned in under 5 seconds. > > 4. Users will eventually request random row/column subsets to run > calculations on, so precomputing our coefficients is not an option. This > needs to be done on request. > > > > I've been looking at many compute solutions, but I'd consider Spark first > due to the widespread use and community. I currently have my data loaded > into Apache Hbase for a different scenario (random access of rows/columns). > I’ve naively tired loading a dataframe from the CSV using a Spark instance > hosted on AWS EMR, but getting the results for even a single correlation > takes over 20 seconds. > > > > Thank you! > > > > > > --gautham > > >