Re: Running Spark Rapids on GPU-Powered Spark Cluster

Gourav Sengupta Fri, 30 Jul 2021 09:15:52 -0700

Hi Artemis,

no one, and I repeat no one, is monopolising the data science market, in
fact almost all algorithms and code and papers are available for free with
largest open source contributions coming in from Amazon, Google, and Azure,
who you are saying are trying to monopolise the market.


I think that we owe to these large corporations who spent billions and then
open source their products. In this chain, which I am one of the oldest
members, you will receive responses from Matei Zaharia, Reynold Xin, Burak,
TD, Michael Amburst, and so on.

I personally find myself fortunate to be a part of this kind of a group.
They still are founders of Databricks which is a profit making company, but
all innovations from Databricks are eventually given away for free by
projects which are headed by the employees of Databricks.

Let us please be grateful and acknowledge their kindness if possible. I am
sure we will all find help that we seek, but the help will most likely come
from those as well who are paid and supported by companies towards whom you
are being so unkind


Regards,
Gourav Sengupta




On Fri, Jul 30, 2021 at 4:02 PM Artemis User <arte...@dtechspace.com> wrote:

> Thanks Gourav for the info.  Actually I am looking for concrete
> experiences and detailed best practices from people who have build their
> own GPU-powered environment instead of relying on big cloud providers who
> are dominating and trying to monopolize the data science market....
>
> -- ND
>
> On 7/30/21 4:37 AM, Gourav Sengupta wrote:
>
> Hi,
>
> there are no cons of using SPARK with GPU's you just have to be careful
> about the GPU memory and a few other details.
>
> I have seen sometimes 10x improvement over general SPARK 3.x performance
> and sometimes around 30x.
>
> Not all the queries will be performant with GPU's and it is up to you to
> test out scenarios specific to you. I use EMR for this option and it is
> really impressive what NVIDIA folks have done.
>
> I think, there was an initial promise with SPARK 3.x release that SPARK
> dataframes can be transferred directly through native integration to
> tensorflow and others, which is a brilliant way forward for SPARK, but I
> think that SPARK project leaders are yet to prioritise it.
>
> Also Ray, another project by Berkeley, is trying to make SPARK dataframes
> transfer to tensorflow. Clearly if SPARK users use Ray to transfer SPARK
> dataframes to tensorflow and other frameworks, then obviously Ray will have
> massive adoption.
>
> Personally I think that SPARK community could have just built the
> integration with other frameworks natively given the fantastic
> contributions by NVIDIA to SPARK and such a large active development
> community, but surely Ray also has to win as well and nothing better than
> to ride on the success of SPARK. But I may be wrong, and SPARK community
> may still be developing those integrations.
>
>
> Regards,
> Gourav Sengupta
>
>
> On Fri, Jul 30, 2021 at 2:46 AM Artemis User <arte...@dtechspace.com>
> wrote:
>
>> Has anyone had any experience with running Spark-Rapids on a GPU-powered
>> cluster (https://github.com/NVIDIA/spark-rapids)?  I am very interested
>> in knowing:
>>
>>    1. What is the hardware/software platform and the type of Spark
>>    cluster you are using to run Spark-Rapids?
>>    2. How easy was the installation process?
>>    3. Are you running Scala or PySpark or both with Spark-Rapids?
>>    4. Have performance you've seen compared with running a CPU-only
>>    cluster?
>>    5. Any pros/cons of using Spark-Rapids?
>>
>> Thanks a lot in advance!
>>
>> -- ND
>>
>
>

Re: Running Spark Rapids on GPU-Powered Spark Cluster

Reply via email to