GPU job in Spark 3

2021-04-09 Thread Martin Somers
Hi Everyone !!

Im trying to get on premise GPU instance of Spark 3 running on my ubuntu
box, and I am following:
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#example-join-operation

Anyone with any insight into why a spark job isnt being ran on the GPU -
appears to be all on the CPU, hadoop binary installed and appears to be
functioning fine

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

here is my setup on ubuntu20.10


▶ nvidia-smi

+-+
| NVIDIA-SMI 460.39   Driver Version: 460.39   CUDA Version: 11.2
  |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute
M. |
|   |  |   MIG
M. |
|===+==+==|
|   0  GeForce RTX 3090Off  | :21:00.0  On |
 N/A |
|  0%   38CP819W / 370W |478MiB / 24265MiB |  0%
 Default |
|   |  |
 N/A |
+---+--+--+

/opt/sparkRapidsPlugin


▶ ls
cudf-0.18.1-cuda11.jar  getGpusResources.sh  rapids-4-spark_2.12-0.4.1.jar

▶ scalac --version
Scala compiler version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and
Lightbend, Inc.


▶ spark-shell --version
2021-04-09 17:05:36,158 WARN util.Utils: Your hostname, studio resolves to
a loopback address: 127.0.1.1; using 192.168.0.221 instead (on interface
wlp71s0)
2021-04-09 17:05:36,159 WARN util.Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.10
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:04:02Z
Revision 1d550c4e90275ab418b9161925049239227f3dc9
Url https://github.com/apache/spark
Type --help for more information.


here is how I calling spark prior to adding the test job

$SPARK_HOME/bin/spark-shell \
   --master local \
   --num-executors 1 \
   --conf spark.executor.cores=16 \
   --conf spark.rapids.sql.concurrentGpuTasks=1 \
   --driver-memory 10g \
   --conf
spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR}

   --conf spark.rapids.memory.pinnedPool.size=16G \
   --conf spark.locality.wait=0s \
   --conf spark.sql.files.maxPartitionBytes=512m \
   --conf spark.sql.shuffle.partitions=10 \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \
   --files $SPARK_RAPIDS_DIR/getGpusResources.sh \
   --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}


Test job is from the example join-operation

val df = sc.makeRDD(1 to 1000, 6).toDF
val df2 = sc.makeRDD(1 to 1000, 6).toDF
df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" ===
$"b").count


I just noticed that the scala versions are out of sync - that shouldnt
affect it?


is there anything else I can try in the --conf or is there any logs to see
what might be failing behind the scenes, any suggestions?


Thanks
Martin


-- 
M


DCOS - s3

2016-08-21 Thread Martin Somers
I having trouble loading data from an s3 repo
Currently DCOS is running spark 2 so I not sure if there is a modifcation
to code with the upgrade

my code atm looks like this


sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxx")

  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val fname = "s3n://somespark/datain.csv"
  // val rows = sc.textFile(fname).map { line =>
  // val values = line.split(',').map(_.toDouble)
  // Vectors.dense(values)
  // }


  val rows = sc.textFile(fname)
  rows.count()



the spark survice returns a failed message - but little information to
exactly why the job didnt run


any suggestions to what i an try?
-- 
M


UNSUBSCRIBE

2016-08-09 Thread Martin Somers
-- 
M


Unsubscribe.

2016-08-09 Thread Martin Somers
Unsubscribe.

Thanks
M


Re: libraryDependencies

2016-07-26 Thread Martin Somers
cheers - I updated

libraryDependencies  ++= Seq(
  // other dependencies here
  "org.apache.spark" %% "spark-core" % "1.6.2" % "provided",
  "org.apache.spark" %% "spark-mllib_2.10" % "1.6.2",
  "org.scalanlp" %% "breeze" % "0.12",
  // native libraries are not included by default. add this if
you want them (as of 0.7)
  // native libraries greatly improve performance, but increase
jar sizes.
  "org.scalanlp" %% "breeze-natives" % "0.12",
)

and getting similar error

Compiling 1 Scala source to
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/target/scala-2.11/classes...
[error]
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:2:
object mllib is not a member of package org.apache.spark
[error] import org.apache.spark.mllib.linalg.distributed.RowMatrix
[error] ^
[error]
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:3:
object mllib is not a member of package org.apache.spark
[error] import org.apache.spark.mllib.linalg.SingularValueDecomposition
[error] ^
[error]
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:5:
object mllib is not a member of package org.apache.spark
[error] import org.apache.spark.mllib.linalg.{Vector, Vectors}
[error] ^
[error]
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:8:
not found: object breeze

On Tue, Jul 26, 2016 at 8:36 PM, Michael Armbrust 
wrote:

> Also, you'll want all of the various spark versions to be the same.
>
> On Tue, Jul 26, 2016 at 12:34 PM, Michael Armbrust  > wrote:
>
>> If you are using %% (double) then you do not need _2.11.
>>
>> On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers 
>> wrote:
>>
>>>
>>> my build file looks like
>>>
>>> libraryDependencies  ++= Seq(
>>>   // other dependencies here
>>>   "org.apache.spark" %% "spark-core" % "1.6.2" % "provided",
>>>   "org.apache.spark" %% "spark-mllib_2.11" % "1.6.0",
>>>   "org.scalanlp" % "breeze_2.11" % "0.7",
>>>   // native libraries are not included by default. add this
>>> if you want them (as of 0.7)
>>>   // native libraries greatly improve performance, but
>>> increase jar sizes.
>>>   "org.scalanlp" % "breeze-natives_2.11" % "0.7",
>>> )
>>>
>>> not 100% sure on the version numbers if they are indeed correct
>>> getting an error of
>>>
>>> [info] Resolving jline#jline;2.12.1 ...
>>> [info] Done updating.
>>> [info] Compiling 1 Scala source to
>>> /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/target/scala-2.11/classes...
>>> [error]
>>> /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:2:
>>> object mllib is not a member of package org.apache.spark
>>> [error] import org.apache.spark.mllib.linalg.distributed.RowMatrix
>>> 
>>> ...
>>>
>>>
>>> Im trying to import in
>>>
>>> import org.apache.spark.mllib.linalg.distributed.RowMatrix
>>> import org.apache.spark.mllib.linalg.SingularValueDecomposition
>>>
>>> import org.apache.spark.mllib.linalg.{Vector, Vectors}
>>>
>>>
>>> import breeze.linalg._
>>> import breeze.linalg.{ Matrix => B_Matrix }
>>> import breeze.linalg.{ Vector => B_Matrix }
>>> import breeze.linalg.DenseMatrix
>>>
>>> object MyApp {
>>>   def main(args: Array[String]): Unit = {
>>> //code here
>>> }
>>>
>>>
>>> It might not be the correct way of doing this
>>>
>>> Anyone got any suggestion
>>> tks
>>> M
>>>
>>>
>>>
>>>
>>
>


-- 
M


libraryDependencies

2016-07-26 Thread Martin Somers
my build file looks like

libraryDependencies  ++= Seq(
  // other dependencies here
  "org.apache.spark" %% "spark-core" % "1.6.2" % "provided",
  "org.apache.spark" %% "spark-mllib_2.11" % "1.6.0",
  "org.scalanlp" % "breeze_2.11" % "0.7",
  // native libraries are not included by default. add this if
you want them (as of 0.7)
  // native libraries greatly improve performance, but increase
jar sizes.
  "org.scalanlp" % "breeze-natives_2.11" % "0.7",
)

not 100% sure on the version numbers if they are indeed correct
getting an error of

[info] Resolving jline#jline;2.12.1 ...
[info] Done updating.
[info] Compiling 1 Scala source to
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/target/scala-2.11/classes...
[error]
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:2:
object mllib is not a member of package org.apache.spark
[error] import org.apache.spark.mllib.linalg.distributed.RowMatrix

...


Im trying to import in

import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.SingularValueDecomposition

import org.apache.spark.mllib.linalg.{Vector, Vectors}


import breeze.linalg._
import breeze.linalg.{ Matrix => B_Matrix }
import breeze.linalg.{ Vector => B_Matrix }
import breeze.linalg.DenseMatrix

object MyApp {
  def main(args: Array[String]): Unit = {
//code here
}


It might not be the correct way of doing this

Anyone got any suggestion
tks
M


sbt build under scala

2016-07-26 Thread Martin Somers
Just wondering

Whats is the correct way of building a spark job using scala - are there
any changes coming with spark v2

Ive been following this post

http://www.infoobjects.com/spark-submit-with-sbt/



Then again Ive been mainly using docker locally what is decent container
for submitting these jobs locally

Im getting to a stage where I need to submit jobs remotely and thinking of
the best way of doing so


tks

M


SVD output within Spark

2016-07-21 Thread Martin Somers
just looking at a comparision between Matlab and Spark for svd with an
input matrix N


this is matlab code - yes very small matrix

N =

2.5903   -0.04160.6023
   -0.12362.55960.7629
0.0148   -0.06930.2490



U =

   -0.3706   -0.92840.0273
   -0.92870.37080.0014
   -0.0114   -0.0248   -0.9996


Spark code

// Breeze to spark
val N1D = N.reshape(1, 9).toArray


// Note I had to transpose array to get correct values with incorrect signs
val V2D = N1D.grouped(3).toArray.transpose


// Then convert the array into a RDD
// val NVecdis = Vectors.dense(N1D.map(x => x.toDouble))
// val V2D = N1D.grouped(3).toArray


val rowlocal = V2D.map{x => Vectors.dense(x)}
val rows = sc.parallelize(rowlocal)
val mat = new RowMatrix(rows)
val mat = new RowMatrix(rows)
val svd = mat.computeSVD(mat.numCols().toInt, computeU=true)



Spark Output - notice the change in sign on the 2nd and 3rd column
-0.3158590633523746   0.9220516369164243   -0.22372713505049768
-0.8822050381939436   -0.3721920780944116  -0.28842213436035985
-0.34920956843045253  0.10627246051309004  0.9309988407367168



And finally some julia code
N  = [2.59031-0.0416335  0.602295;
-0.1235842.559640.762906;
0.0148463  -0.0693119  0.249017]

svd(N, thin=true)   --- same as matlab
-0.315859  -0.922052   0.223727
-0.882205   0.372192   0.288422
-0.34921   -0.106272  -0.930999

Most likely its an issue with my implementation rather than being a bug
with svd within the spark environment
My spark instance is running locally with a docker container
Any suggestions
tks