Hey Martin,
I would encourage you to file issues in the spark-rapids repo for questions 
with that plugin: https://github.com/NVIDIA/spark-rapids/issues
I'm assuming the query ran and you looked at the sql UI or the .expalin() 
output and it was on cpu and not gpu?  I am assuming you have the cuda 11.0 
runtime installed (look in /usr/local). You printed the driver version which is 
11.2 but the runtimes can be different. You are using the 11.0 cuda version of 
the cudf library. If that didn't match runtime though it would have failed and 
not ran anything.
The easiest way to tell why it didn't run on the GPU is to enable the config: 
spark.rapids.sql.explain=NOT_ON_GPU 
It will print out logs to your console as to why different operators don't run 
on the gpu.  
Again feel free to open up a question issues in the spark-rapids repo and we 
can discuss more there.
Tom
    On Friday, April 9, 2021, 11:19:05 AM CDT, Martin Somers 
<sono...@gmail.com> wrote:  
 
 
Hi Everyone !!

Im trying to get on premise GPU instance of Spark 3 running on my ubuntu box, 
and I am following:  
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#example-join-operation

Anyone with any insight into why a spark job isnt being ran on the GPU - 
appears to be all on the CPU, hadoop binary installed and appears to be 
functioning fine  

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
here is my setup on ubuntu20.10


▶ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:21:00.0  On |                  N/A |
|  0%   38C    P8    19W / 370W |    478MiB / 24265MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

/opt/sparkRapidsPlugin                                                          
                                                                                
             
▶ ls
cudf-0.18.1-cuda11.jar  getGpusResources.sh  rapids-4-spark_2.12-0.4.1.jar

▶ scalac --version
Scala compiler version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, 
Inc.


▶ spark-shell --version
2021-04-09 17:05:36,158 WARN util.Utils: Your hostname, studio resolves to a 
loopback address: 127.0.1.1; using 192.168.0.221 instead (on interface wlp71s0)
2021-04-09 17:05:36,159 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind 
to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
                        
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.10
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:04:02Z
Revision 1d550c4e90275ab418b9161925049239227f3dc9
Url https://github.com/apache/spark
Type --help for more information.


here is how I calling spark prior to adding the test job 

$SPARK_HOME/bin/spark-shell \
       --master local \
       --num-executors 1 \
       --conf spark.executor.cores=16 \
       --conf spark.rapids.sql.concurrentGpuTasks=1 \
       --driver-memory 10g \
       --conf 
spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}      
 
       --conf spark.rapids.memory.pinnedPool.size=16G \
       --conf spark.locality.wait=0s \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.sql.shuffle.partitions=10 \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --files $SPARK_RAPIDS_DIR/getGpusResources.sh \
       --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}


Test job is from the example join-operation 

val df = sc.makeRDD(1 to 10000000, 6).toDF
val df2 = sc.makeRDD(1 to 10000000, 6).toDF
df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" === 
$"b").count


I just noticed that the scala versions are out of sync - that shouldnt affect 
it?


is there anything else I can try in the --conf or is there any logs to see what 
might be failing behind the scenes, any suggestions?


Thanks
Martin

-- 
M  

Reply via email to