RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
Users can also request random rows in those columns. So a user can request a subset of the matrix (N rows and N columns) which would change the value of the correlation coefficient. From: Jerry Vinokurov [mailto:grapesmo...@gmail.com] Sent: Wednesday, July 17, 2019 1:27 PM To:

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Jerry Vinokurov
Maybe I'm not understanding something about this use case, but why is precomputation not an option? Is it because the matrices themselves change? Because if the matrices are constant, then I think precomputation would work for you even if the users request random correlations. You can just store

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
As I said in the my initial message, precomputing is not an option. Retrieving only the top/bottom N most correlated is an option – would that speed up the results? Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues. --gautham From: Patrick McCarthy

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Patrick McCarthy
Do you really need the results of all 3MM computations, or only the top- and bottom-most correlation coefficients? Could correlations be computed on a sample and from that estimate a distribution of coefficients? Would it make sense to precompute offline and instead focus on fast key-value

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
Thanks for the reply, Bobby. I’ve received notice that we can probably tolerate response times of up to 30 seconds. Would this be more manageable? 5 seconds was an initial ask, but 20-30 seconds is also a reasonable response time for our use case. With the new SLA, do you think that we can

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Bobby Evans
Let's do a few quick rules of thumb to get an idea of what kind of processing power you will need in general to do what you want. You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends up being about 560 GB that you need to fully process in 5 seconds. If you are reading this

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Steven Stetzler
Hi Gautham, I am a beginner spark user too and I may not have a complete understanding of your question, but I thought I would start a discussion anyway. Have you looked into using Spark's built in Correlation function? ( https://spark.apache.org/docs/latest/ml-statistics.html) This might let you

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Gautham Acharya
Ping? I would really appreciate advice on this! Thank you! From: Gautham Acharya Sent: Tuesday, July 9, 2019 4:22 PM To: user@spark.apache.org Subject: [Beginner] Run compute on large matrices and return the result in seconds? This is my first email to this mailing list, so I apologize if I

[Beginner] Run compute on large matrices and return the result in seconds?

2019-07-09 Thread Gautham Acharya
This is my first email to this mailing list, so I apologize if I made any errors. My team's going to be building an application and I'm investigating some options for distributed compute systems. We want to be performing computes on large matrices. The requirements are as follows: 1.