Do you really need the results of all 3MM computations, or only the top-
and bottom-most correlation coefficients? Could correlations be computed on
a sample and from that estimate a distribution of coefficients? Would it
make sense to precompute offline and instead focus on fast key-value
retrieval, like ElasticSearch or ScyllaDB?

Spark is a compute framework rather than a serving backend, I don't think
it's designed with retrieval SLAs in mind and you may find those SLAs
difficult to maintain.

On Wed, Jul 17, 2019 at 3:14 PM Gautham Acharya <gauth...@alleninstitute.org>
wrote:

> Thanks for the reply, Bobby.
>
>
>
> I’ve received notice that we can probably tolerate response times of up to
> 30 seconds. Would this be more manageable? 5 seconds was an initial ask,
> but 20-30 seconds is also a reasonable response time for our use case.
>
>
>
> With the new SLA, do you think that we can easily perform this computation
> in spark?
>
> --gautham
>
>
>
> *From:* Bobby Evans [mailto:reva...@gmail.com]
> *Sent:* Wednesday, July 17, 2019 7:06 AM
> *To:* Steven Stetzler <steven.stetz...@gmail.com>
> *Cc:* Gautham Acharya <gauth...@alleninstitute.org>; user@spark.apache.org
> *Subject:* Re: [Beginner] Run compute on large matrices and return the
> result in seconds?
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> ------------------------------
>
> Let's do a few quick rules of thumb to get an idea of what kind of
> processing power you will need in general to do what you want.
>
>
>
> You need 3,000,000 ints by 50,000 rows.  Each int is 4 bytes so that ends
> up being about 560 GB that you need to fully process in 5 seconds.
>
>
>
> If you are reading this from spinning disks (which average about 80 MB/s)
> you would need at least 1,450 disks to just read the data in 5 seconds
> (that number can vary a lot depending on the storage format and your
> compression ratio).
>
> If you are reading the data over a network (let's say 10GigE even though
> in practice you cannot get that in the cloud easily) you would need about
> 90 NICs just to read the data in 5 seconds, again depending on the
> compression ration this may be lower.
>
> If you assume you have a cluster where it all fits in main memory and have
> cached all of the data in memory (which in practice I have seen on most
> modern systems at somewhere between 12 and 16 GB/sec) you would need
> between 7 and 10 machines just to read through the data once in 5 seconds.
> Spark also stores cached data compressed so you might need less as well.
>
>
>
> All the numbers fit with things that spark should be able to handle, but a
> 5 second SLA is very tight for this amount of data.
>
>
>
> Can you make this work with Spark?  probably. Does spark have something
> built in that will make this fast and simple for you?  I doubt it you have
> some very tight requirements and will likely have to write something custom
> to make it work the way you want.
>
>
>
>
>
> On Thu, Jul 11, 2019 at 4:12 PM Steven Stetzler <steven.stetz...@gmail.com>
> wrote:
>
> Hi Gautham,
>
>
>
> I am a beginner spark user too and I may not have a complete understanding
> of your question, but I thought I would start a discussion anyway. Have you
> looked into using Spark's built in Correlation function? (
> https://spark.apache.org/docs/latest/ml-statistics.html
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fml-statistics.html&data=02%7C01%7C%7C7d44353d2dd5420bc35108d70abff11d%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C636989691818858010&sdata=UG7owx%2FyHayKECNbDbfoNV53nJCSlF06Oak1plpi4RY%3D&reserved=0>)
> This might let you get what you want (per-row correlation against the same
> matrix) without having to deal with parallelizing the computation yourself.
> Also, I think the question of how quick you can get your results is largely
> a data access question vs how fast is Spark question. As long as you can
> exploit data parallelism (i.e. you can partition up your data), Spark will
> give you a speedup. You can imagine that if you had a large machine with
> many cores and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could
> fit your problem in main memory and perform your computation with thread
> based parallelism. This might get your result relatively quickly. For a
> dedicated application with well constrained memory and compute
> requirements, it might not be a bad option to do everything on one machine
> as well. Accessing an external database and distributing work over a large
> number of computers can add overhead that might be out of your control.
>
>
>
> Thanks,
>
> Steven
>
>
>
> On Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya <
> gauth...@alleninstitute.org> wrote:
>
> Ping? I would really appreciate advice on this! Thank you!
>
>
>
> *From:* Gautham Acharya
> *Sent:* Tuesday, July 9, 2019 4:22 PM
> *To:* user@spark.apache.org
> *Subject:* [Beginner] Run compute on large matrices and return the result
> in seconds?
>
>
>
> This is my first email to this mailing list, so I apologize if I made any
> errors.
>
>
>
> My team's going to be building an application and I'm investigating some
> options for distributed compute systems. We want to be performing computes
> on large matrices.
>
>
>
> The requirements are as follows:
>
>
>
> 1.     The matrices can be expected to be up to 50,000 columns x 3
> million rows. The values are all integers (except for the row/column
> headers).
>
> 2.     The application needs to select a specific row, and calculate the
> correlation coefficient (
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.corr.html&data=02%7C01%7C%7C7d44353d2dd5420bc35108d70abff11d%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C636989691818868018&sdata=e5blX8ItE1JDJRx9D3FnmsXp4TnOKvyH6fA6%2Fw2QTbI%3D&reserved=0>
>  )
> against every other row. This means up to 3 million different calculations.
>
> 3.     A sorted list of the correlation coefficients and their
> corresponding row keys need to be returned in under 5 seconds.
>
> 4.     Users will eventually request random row/column subsets to run
> calculations on, so precomputing our coefficients is not an option. This
> needs to be done on request.
>
>
>
> I've been looking at many compute solutions, but I'd consider Spark first
> due to the widespread use and community. I currently have my data loaded
> into Apache Hbase for a different scenario (random access of rows/columns).
> I’ve naively tired loading a dataframe from the CSV using a Spark instance
> hosted on AWS EMR, but getting the results for even a single correlation
> takes over 20 seconds.
>
>
>
> Thank you!
>
>
>
>
>
> --gautham
>
>
>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply via email to