Users can also request random rows in those columns. So a user can request a
subset of the matrix (N rows and N columns) which would change the value of the
correlation coefficient.
From: Jerry Vinokurov [mailto:grapesmo...@gmail.com]
Sent: Wednesday, July 17, 2019 1:27 PM
To:
Maybe I'm not understanding something about this use case, but why is
precomputation not an option? Is it because the matrices themselves change?
Because if the matrices are constant, then I think precomputation would
work for you even if the users request random correlations. You can just
store
As I said in the my initial message, precomputing is not an option.
Retrieving only the top/bottom N most correlated is an option – would that
speed up the results?
Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues.
--gautham
From: Patrick McCarthy
Do you really need the results of all 3MM computations, or only the top-
and bottom-most correlation coefficients? Could correlations be computed on
a sample and from that estimate a distribution of coefficients? Would it
make sense to precompute offline and instead focus on fast key-value
Thanks for the reply, Bobby.
I’ve received notice that we can probably tolerate response times of up to 30
seconds. Would this be more manageable? 5 seconds was an initial ask, but 20-30
seconds is also a reasonable response time for our use case.
With the new SLA, do you think that we can
Let's do a few quick rules of thumb to get an idea of what kind of
processing power you will need in general to do what you want.
You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends
up being about 560 GB that you need to fully process in 5 seconds.
If you are reading this
Hi Gautham,
I am a beginner spark user too and I may not have a complete understanding
of your question, but I thought I would start a discussion anyway. Have you
looked into using Spark's built in Correlation function? (
https://spark.apache.org/docs/latest/ml-statistics.html) This might let you
Ping? I would really appreciate advice on this! Thank you!
From: Gautham Acharya
Sent: Tuesday, July 9, 2019 4:22 PM
To: user@spark.apache.org
Subject: [Beginner] Run compute on large matrices and return the result in
seconds?
This is my first email to this mailing list, so I apologize if I