Re: Spark ML's RandomForestClassifier OOM
No. I am running Spark on YARN on a 3 node testing cluster. My guess is that given the amount of splits done by a hundred trees of depth 30 (which should be more than 100 * 2^30), either the executors or the driver die OOM while trying to store all the split metadata. I guess that the same issue affects both local and distributed modes. But those are just conjectures. -- Julio > El 10 ene 2017, a las 11:22, Marco Mistroni <mmistr...@gmail.com> escribió: > > You running locally? Found exactly same issue. > 2 solutions: > _ reduce datA size. > _ run on EMR > Hth > >> On 10 Jan 2017 10:07 am, "Julio Antonio Soto" <ju...@esbet.es> wrote: >> Hi, >> >> I am running into OOM problems while training a Spark ML >> RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees). >> >> My dataset is arguably pretty big given the executor count and size (8x5G), >> with approximately 20M rows and 130 features. >> >> The "fun fact" is that a single DecisionTreeClassifier with the same specs >> (same maxDepth and maxBins) is able to train without problems in a couple of >> minutes. >> >> AFAIK the current random forest implementation grows each tree sequentially, >> which means that DecisionTreeClassifiers are fit one by one, and therefore >> the training process should be similar in terms of memory consumption. Am I >> missing something here? >> >> Thanks >> Julio
Spark ML's RandomForestClassifier OOM
Hi, I am running into OOM problems while training a Spark ML RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees). My dataset is arguably pretty big given the executor count and size (8x5G), with approximately 20M rows and 130 features. The "fun fact" is that a single DecisionTreeClassifier with the same specs (same maxDepth and maxBins) is able to train without problems in a couple of minutes. AFAIK the current random forest implementation grows each tree sequentially, which means that DecisionTreeClassifiers are fit one by one, and therefore the training process should be similar in terms of memory consumption. Am I missing something here? Thanks Julio
OOM on yarn-cluster mode
Hi, I'm having trouble when uploadig spark jobs in yarn-cluster mode. While the job works and completes in yarn-client mode, I hit the following error when using spark-submit in yarn-cluster (simplified): 16/01/19 21:43:31 INFO hive.metastore: Connected to metastore. 16/01/19 21:43:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/19 21:43:32 INFO session.SessionState: Created local directory: /yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/77350a02-d900-4c84-9456-134305044d21_resources 16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory: /tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21 16/01/19 21:43:32 INFO session.SessionState: Created local directory: /yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/nobody/77350a02-d900-4c84-9456-134305044d21 16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory: /tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21/_tmp_space.db 16/01/19 21:43:32 INFO parquet.ParquetRelation: Listing hdfs://namenode01:8020/user/julio/PFM/CDRs_parquet_np on driver 16/01/19 21:43:33 INFO spark.SparkContext: Starting job: table at code.scala:13 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Got job 0 (table at code.scala:13) with 8 output partitions 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(table at code.scala:13) 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Missing parents: List() 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at table at code.scala:13), which has no missing parents Exception in thread "dag-scheduler-event-loop" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "dag-scheduler-event-loop" Exception in thread "SparkListenerBus" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "SparkListenerBus" It happens with whatever program I build, for example: object MainClass { def main(args:Array[String]):Unit = { val conf = (new org.apache.spark.SparkConf() .setAppName("test") ) val sc = new org.apache.spark.SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val rdd = (sqlContext.read.table("cdrs_np") .na.drop(how="any") .map(_.toSeq.map(y=>y.toString)) .map(x=>(x.head,x.tail) ) rdd.saveAsTextFile(args(0)) } } The command I'm using in spark-submit is the following: spark-submit --master yarn \ --deploy-mode cluster \ --driver-memory 1G \ --executor-memory 3000m \ --executor-cores 1 \ --num-executors 8 \ --class MainClass \ spark-yarn-cluster-test_2.10-0.1.jar \ hdfs://namenode01/etl/test I've got more than enough resources in my cluster in order to run the job (in fact, the exact same command works in --deploy-mode client). I tried to increase yarn.app.mapreduce.am.resource.mb to 2GB, but that didn't work. I guess there is another parameter I should tweak, but I have not found any info whatsoever in the Internet. I'm running Spark 1.5.2 and YARN from Hadoop 2.6.0-cdh5.5.1. Any help would be greatly appreciated! Thank you. -- Julio Antonio Soto de Vicente
Re: OOM on yarn-cluster mode
Hi, I tried with --driver-memory 16G (more than enough to read a simple parquet table), but the problem still persists. Everything works fine in yarn-client. -- Julio Antonio Soto de Vicente > El 19 ene 2016, a las 22:18, Saisai Shao <sai.sai.s...@gmail.com> escribió: > > You could try increase the driver memory by "--driver-memory", looks like the > OOM is came from driver side, so the simple solution is to increase the > memory of driver. > >> On Tue, Jan 19, 2016 at 1:15 PM, Julio Antonio Soto <ju...@esbet.es> wrote: >> Hi, >> >> I'm having trouble when uploadig spark jobs in yarn-cluster mode. While the >> job works and completes in yarn-client mode, I hit the following error when >> using spark-submit in yarn-cluster (simplified): >> 16/01/19 21:43:31 INFO hive.metastore: Connected to metastore. >> 16/01/19 21:43:32 WARN util.NativeCodeLoader: Unable to load native-hadoop >> library for your platform... using builtin-java classes where applicable >> 16/01/19 21:43:32 INFO session.SessionState: Created local directory: >> /yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/77350a02-d900-4c84-9456-134305044d21_resources >> 16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory: >> /tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21 >> 16/01/19 21:43:32 INFO session.SessionState: Created local directory: >> /yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/nobody/77350a02-d900-4c84-9456-134305044d21 >> 16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory: >> /tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21/_tmp_space.db >> 16/01/19 21:43:32 INFO parquet.ParquetRelation: Listing >> hdfs://namenode01:8020/user/julio/PFM/CDRs_parquet_np on driver >> 16/01/19 21:43:33 INFO spark.SparkContext: Starting job: table at >> code.scala:13 >> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Got job 0 (table at >> code.scala:13) with 8 output partitions >> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Final stage: ResultStage >> 0(table at code.scala:13) >> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Parents of final stage: List() >> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Missing parents: List() >> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Submitting ResultStage 0 >> (MapPartitionsRDD[1] at table at code.scala:13), which has no missing parents >> Exception in thread "dag-scheduler-event-loop" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "dag-scheduler-event-loop" >> Exception in thread "SparkListenerBus" >> Exception: java.lang.OutOfMemoryError thrown from the >> UncaughtExceptionHandler in thread "SparkListenerBus" >> It happens with whatever program I build, for example: >> >> object MainClass { >> def main(args:Array[String]):Unit = { >> val conf = (new org.apache.spark.SparkConf() >> .setAppName("test") >> ) >> >> val sc = new org.apache.spark.SparkContext(conf) >> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) >> >> val rdd = (sqlContext.read.table("cdrs_np") >> .na.drop(how="any") >> .map(_.toSeq.map(y=>y.toString)) >> .map(x=>(x.head,x.tail) >> ) >> >> rdd.saveAsTextFile(args(0)) >> } >> } >> >> The command I'm using in spark-submit is the following: >> >> spark-submit --master yarn \ >> --deploy-mode cluster \ >> --driver-memory 1G \ >> --executor-memory 3000m \ >> --executor-cores 1 \ >> --num-executors 8 \ >> --class MainClass \ >> spark-yarn-cluster-test_2.10-0.1.jar \ >> hdfs://namenode01/etl/test >> >> I've got more than enough resources in my cluster in order to run the job >> (in fact, the exact same command works in --deploy-mode client). >> >> I tried to increase yarn.app.mapreduce.am.resource.mb to 2GB, but that >> didn't work. I guess there is another parameter I should tweak, but I have >> not found any info whatsoever in the Internet. >> >> I'm running Spark 1.5.2 and YARN from Hadoop 2.6.0-cdh5.5.1. >> >> >> Any help would be greatly appreciated! >> >> Thank you. >> >> -- >> Julio Antonio Soto de Vicente >