[jira] [Commented] (SPARK-32411) GPU Cluster Fail

2020-07-23 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163742#comment-17163742
 ] 

L. C. Hsieh commented on SPARK-32411:
-

I think it is because the configs.

"spark.task.resource.gpu.amount  2" means each task requires 2 gpus, but 
"spark.executor.resource.gpu.amount 1" specifies each executor has only 1 gpu. 
So task scheduler cannot find an executor which meets task requirement.

> GPU Cluster Fail
> 
>
> Key: SPARK-32411
> URL: https://issues.apache.org/jira/browse/SPARK-32411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 3.0.0
> Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
> with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
> using pyspark,
>Reporter: Vinh Tran
>Priority: Major
>
> I'm having a difficult time getting a GPU cluster started on Apache Spark 
> 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA 
> github page for Rapids which suggested the following additional edits to the 
> spark-defaults.conf:
> {code:java}
> spark.task.resource.gpu.amount 0.25
> spark.executor.resource.gpu.discoveryScript 
> ./usr/local/spark/getGpusResources.sh{code}
> I have a Apache Spark 3.0 cluster consisting of machines with multiple 
> nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
> however it results in the following error: 
> {code:java}
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: You must specify an amount for gpu
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
>   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
>   at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:238)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> After this, I then tried adding another line to the conf per the instructions 
> which results in no errors, however when I log in to the Web UI at 
> localhost:8080, under Running Applications, the state remains at waiting.
> {code:java}
> spark.task.resource.gpu.amount  2
> spark.executor.resource.gpu.discoveryScript
> ./usr/local/spark/getGpusResources.sh
> spark.executor.resource.gpu.amount  1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32411) GPU Cluster Fail

2020-08-09 Thread Chitral Verma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173944#comment-17173944
 ] 

Chitral Verma commented on SPARK-32411:
---

[~vinhdiesal] were you able to resolve this issue?

I'm also facing the same issue, my spark config us as below. the spark session 
initializes but no tasks execute as they stay in a waiting state.

 
{quote}{color:#00}spark = SparkSession \{color}
{color:#00}.builder \{color}
{color:#00}.master({color}{color:#a31515}"local"{color}{color:#00}) 
\{color}
{color:#00}.config({color}{color:#a31515}"spark.ui.port"{color}{color:#00},
 spark_ui_port) \{color}
{color:#00}.config({color}{color:#a31515}"spark.jars"{color}{color:#00},
 {color}{color:#a31515}","{color}{color:#00}.join(jars)) \{color}
{color:#00}.config({color}{color:#a31515}"spark.plugins"{color}{color:#00},
 {color}{color:#a31515}"com.nvidia.spark.SQLPlugin"{color}{color:#00}) 
\{color}
{color:#00}.config({color}{color:#a31515}"spark.sql.shuffle.partitions"{color}{color:#00},
 {color}{color:#a31515}"10"{color}{color:#00}) \{color}
{color:#00}.config({color}{color:#a31515}"spark.driver.resource.gpu.discoveryScript"{color}{color:#00},
 
{color}{color:#a31515}"/content/sparkRapidsPlugin/getGpusResources.sh"{color}{color:#00})
 \{color}
{color:#00}.config({color}{color:#a31515}"spark.driver.resource.gpu.amount"{color}{color:#00},
 {color}{color:#a31515}"1"{color}{color:#00}) \{color}
{color:#00}.config({color}{color:#a31515}"spark.rapids.memory.pinnedPool.size"{color}{color:#00},
 {color}{color:#a31515}"2G"{color}{color:#00}) \{color}
{color:#00}.getOrCreate(){color}{quote}

> GPU Cluster Fail
> 
>
> Key: SPARK-32411
> URL: https://issues.apache.org/jira/browse/SPARK-32411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 3.0.0
> Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
> with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
> using pyspark,
>Reporter: Vinh Tran
>Priority: Major
>
> I'm having a difficult time getting a GPU cluster started on Apache Spark 
> 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA 
> github page for Rapids which suggested the following additional edits to the 
> spark-defaults.conf:
> {code:java}
> spark.task.resource.gpu.amount 0.25
> spark.executor.resource.gpu.discoveryScript 
> ./usr/local/spark/getGpusResources.sh{code}
> I have a Apache Spark 3.0 cluster consisting of machines with multiple 
> nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
> however it results in the following error: 
> {code:java}
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: You must specify an amount for gpu
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
>   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
>   at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:238)
>   at 
> py4j.commands.ConstructorCommand.invokeC

[jira] [Commented] (SPARK-32411) GPU Cluster Fail

2020-09-11 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194485#comment-17194485
 ] 

Thomas Graves commented on SPARK-32411:
---

[~chitralverma] if you are still having an issue please file an issue in 
[https://github.com/NVIDIA/spark-rapids/issues]

> GPU Cluster Fail
> 
>
> Key: SPARK-32411
> URL: https://issues.apache.org/jira/browse/SPARK-32411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 3.0.0
> Environment: Ihave a Apache Spark 3.0 cluster consisting of machines 
> with multiple nvidia-gpus and I connect my jupyter notebook to the cluster 
> using pyspark,
>Reporter: Vinh Tran
>Priority: Major
>
> I'm having a difficult time getting a GPU cluster started on Apache Spark 
> 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA 
> github page for Rapids which suggested the following additional edits to the 
> spark-defaults.conf:
> {code:java}
> spark.task.resource.gpu.amount 0.25
> spark.executor.resource.gpu.discoveryScript 
> ./usr/local/spark/getGpusResources.sh{code}
> I have a Apache Spark 3.0 cluster consisting of machines with multiple 
> nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, 
> however it results in the following error: 
> {code:java}
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: You must specify an amount for gpu
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
>   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
>   at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
>   at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>   at org.apache.spark.SparkContext.(SparkContext.scala:528)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:238)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> After this, I then tried adding another line to the conf per the instructions 
> which results in no errors, however when I log in to the Web UI at 
> localhost:8080, under Running Applications, the state remains at waiting.
> {code:java}
> spark.task.resource.gpu.amount  2
> spark.executor.resource.gpu.discoveryScript
> ./usr/local/spark/getGpusResources.sh
> spark.executor.resource.gpu.amount  1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org