[ 
https://issues.apache.org/jira/browse/SPARK-5743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Charhon updated SPARK-5743:
-------------------------------------
    Description: 
I am running a training a model with a RamdomForest using this code snippet:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=5, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4)


The trainingData RDD looks like :
trainingData.take(2)
[LabeledPoint(0.0, [0.02,0.0,0.05,0.02,0.0,0.05,18.0,123675.0,189.09]), 
LabeledPoint(0.0, [0.04,0.1,0.1,0.04,0.1,0.1,4.0,631180.0,157.36])]
which seems ok. 

The csv file before loading is 30 MB.

When I run my code snippet, I get this error:
15/02/11 14:31:46 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at 
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
        at 
org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:143)
        at 
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:461)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:259)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:255)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:252)
        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:252)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:252)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:252)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:163)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:127)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
        at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.aroundReceive(CoarseGrainedSchedulerBackend.scala:72)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
15/02/11 14:31:46 INFO DAGScheduler: Job 2 failed: count at 
DataValidators.scala:38, took 4.793047 s



PS : I am using Spark 1.2.0 with Python on machines of 8 cores and 32 GB of 
RAM. I have 1 master and 4 worker nodes. I have also tried with up to 30 
machines of 16 cores and 104 GB each.

  was:
I am running a training a model with a RamdomForest using this code snippet:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=5, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4)


The trainingData RDD looks like :
trainingData.take(2)
[LabeledPoint(0.0, [0.02,0.0,0.05,0.02,0.0,0.05,18.0,123675.0,189.09]), 
LabeledPoint(0.0, [0.04,0.1,0.1,0.04,0.1,0.1,4.0,631180.0,157.36])]
which seems ok. 

The csv file before loading is 30 MB.

When I run my code snippet, I get this error:
15/02/11 14:31:46 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem 
[sparkDriver]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at 
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
        at 
org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:143)
        at 
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:461)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:259)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:255)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:252)
        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:252)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:252)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:252)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:163)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:127)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
        at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.aroundReceive(CoarseGrainedSchedulerBackend.scala:72)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
15/02/11 14:31:46 INFO DAGScheduler: Job 2 failed: count at 
DataValidators.scala:38, took 4.793047 s



PS : I am using Spark 1.2.0 with Python on machines of 8 cores and 32 GB of 
RAM. I have 1 master and 4 worker nodes. I have also tried with up to 30 
machines of 16 cores and 104 MB each.


> java.lang.OutOfMemoryError: Java heap space with RandomForest
> -------------------------------------------------------------
>
>                 Key: SPARK-5743
>                 URL: https://issues.apache.org/jira/browse/SPARK-5743
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.0
>         Environment: Debian 7 running on Google Compute Engine
>            Reporter: Guillaume Charhon
>
> I am running a training a model with a RamdomForest using this code snippet:
> model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
>                                     numTrees=5, featureSubsetStrategy="auto",
>                                     impurity='variance', maxDepth=4)
> The trainingData RDD looks like :
> trainingData.take(2)
> [LabeledPoint(0.0, [0.02,0.0,0.05,0.02,0.0,0.05,18.0,123675.0,189.09]), 
> LabeledPoint(0.0, [0.04,0.1,0.1,0.04,0.1,0.1,4.0,631180.0,157.36])]
> which seems ok. 
> The csv file before loading is 30 MB.
> When I run my code snippet, I get this error:
> 15/02/11 14:31:46 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOf(Arrays.java:2271)
>       at 
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
>       at 
> org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:143)
>       at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:461)
>       at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:259)
>       at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>       at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:255)
>       at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:252)
>       at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>       at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>       at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:252)
>       at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:252)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>       at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:252)
>       at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:163)
>       at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:127)
>       at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>       at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>       at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>       at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
>       at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>       at 
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>       at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.aroundReceive(CoarseGrainedSchedulerBackend.scala:72)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 15/02/11 14:31:46 INFO DAGScheduler: Job 2 failed: count at 
> DataValidators.scala:38, took 4.793047 s
> PS : I am using Spark 1.2.0 with Python on machines of 8 cores and 32 GB of 
> RAM. I have 1 master and 4 worker nodes. I have also tried with up to 30 
> machines of 16 cores and 104 GB each.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to