Re: Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Julio Antonio Soto de Vicente
No. I am running Spark on YARN on a 3 node testing cluster. 

My guess is that given the amount of splits done by a hundred trees of depth 30 
(which should be more than 100 * 2^30), either the executors or the driver die 
OOM while trying to store all the split metadata. I guess that  the same issue 
affects both local and distributed modes. But those are just conjectures.

--
Julio

> El 10 ene 2017, a las 11:22, Marco Mistroni <mmistr...@gmail.com> escribió:
> 
> You running locally? Found exactly same issue.
> 2 solutions:
> _ reduce datA size.  
> _ run on EMR
> Hth
> 
>> On 10 Jan 2017 10:07 am, "Julio Antonio Soto" <ju...@esbet.es> wrote:
>> Hi, 
>> 
>> I am running into OOM problems while training a Spark ML 
>> RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).
>> 
>> My dataset is arguably pretty big given the executor count and size (8x5G), 
>> with approximately 20M rows and 130 features.
>> 
>> The "fun fact" is that a single DecisionTreeClassifier with the same specs 
>> (same maxDepth and maxBins) is able to train without problems in a couple of 
>> minutes.
>> 
>> AFAIK the current random forest implementation grows each tree sequentially, 
>> which means that DecisionTreeClassifiers are fit one by one, and therefore 
>> the training process should be similar in terms of memory consumption. Am I 
>> missing something here?
>> 
>> Thanks
>> Julio


Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Julio Antonio Soto
Hi,

I am running into OOM problems while training a Spark ML
RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).

My dataset is arguably pretty big given the executor count and size (8x5G),
with approximately 20M rows and 130 features.

The "fun fact" is that a single DecisionTreeClassifier with the same specs
(same maxDepth and maxBins) is able to train without problems in a couple
of minutes.

AFAIK the current random forest implementation grows each tree
sequentially, which means that DecisionTreeClassifiers are fit one by one,
and therefore the training process should be similar in terms of memory
consumption. Am I missing something here?

Thanks
Julio


OOM on yarn-cluster mode

2016-01-19 Thread Julio Antonio Soto
Hi,

I'm having trouble when uploadig spark jobs in yarn-cluster mode. While the
job works and completes in yarn-client mode, I hit the following error when
using spark-submit in yarn-cluster (simplified):

16/01/19 21:43:31 INFO hive.metastore: Connected to metastore.
16/01/19 21:43:32 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
16/01/19 21:43:32 INFO session.SessionState: Created local directory:
/yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/77350a02-d900-4c84-9456-134305044d21_resources
16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory:
/tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21
16/01/19 21:43:32 INFO session.SessionState: Created local directory:
/yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/nobody/77350a02-d900-4c84-9456-134305044d21
16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory:
/tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21/_tmp_space.db
16/01/19 21:43:32 INFO parquet.ParquetRelation: Listing
hdfs://namenode01:8020/user/julio/PFM/CDRs_parquet_np on driver
16/01/19 21:43:33 INFO spark.SparkContext: Starting job: table at code.scala:13
16/01/19 21:43:33 INFO scheduler.DAGScheduler: Got job 0 (table at
code.scala:13) with 8 output partitions
16/01/19 21:43:33 INFO scheduler.DAGScheduler: Final stage:
ResultStage 0(table at code.scala:13)
16/01/19 21:43:33 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/01/19 21:43:33 INFO scheduler.DAGScheduler: Missing parents: List()
16/01/19 21:43:33 INFO scheduler.DAGScheduler: Submitting ResultStage
0 (MapPartitionsRDD[1] at table at code.scala:13), which has no
missing parents
Exception in thread "dag-scheduler-event-loop"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "dag-scheduler-event-loop"
Exception in thread "SparkListenerBus"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "SparkListenerBus"

It happens with whatever program I build, for example:

object MainClass {
def main(args:Array[String]):Unit = {
val conf = (new org.apache.spark.SparkConf()
 .setAppName("test")
 )

val sc = new org.apache.spark.SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val rdd = (sqlContext.read.table("cdrs_np")
.na.drop(how="any")
.map(_.toSeq.map(y=>y.toString))
.map(x=>(x.head,x.tail)
)

rdd.saveAsTextFile(args(0))
}
}

The command I'm using in spark-submit is the following:

spark-submit --master yarn \
 --deploy-mode cluster \
 --driver-memory 1G \
 --executor-memory 3000m \
 --executor-cores 1 \
 --num-executors 8 \
 --class MainClass \
 spark-yarn-cluster-test_2.10-0.1.jar \
 hdfs://namenode01/etl/test

I've got more than enough resources in my cluster in order to run the job
(in fact, the exact same command works in --deploy-mode client).

I tried to increase yarn.app.mapreduce.am.resource.mb to 2GB, but that
didn't work. I guess there is another parameter I should tweak, but I have
not found any info whatsoever in the Internet.

I'm running Spark 1.5.2 and YARN from Hadoop 2.6.0-cdh5.5.1.


Any help would be greatly appreciated!

Thank you.

-- 
Julio Antonio Soto de Vicente


Re: OOM on yarn-cluster mode

2016-01-19 Thread Julio Antonio Soto de Vicente
Hi,

I tried with --driver-memory 16G (more than enough to read a simple parquet 
table), but the problem still persists.

Everything works fine in yarn-client.

--
Julio Antonio Soto de Vicente

> El 19 ene 2016, a las 22:18, Saisai Shao <sai.sai.s...@gmail.com> escribió:
> 
> You could try increase the driver memory by "--driver-memory", looks like the 
> OOM is came from driver side, so the simple solution is to increase the 
> memory of driver.
> 
>> On Tue, Jan 19, 2016 at 1:15 PM, Julio Antonio Soto <ju...@esbet.es> wrote:
>> Hi,
>> 
>> I'm having trouble when uploadig spark jobs in yarn-cluster mode. While the 
>> job works and completes in yarn-client mode, I hit the following error when 
>> using spark-submit in yarn-cluster (simplified):
>> 16/01/19 21:43:31 INFO hive.metastore: Connected to metastore.
>> 16/01/19 21:43:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/01/19 21:43:32 INFO session.SessionState: Created local directory: 
>> /yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/77350a02-d900-4c84-9456-134305044d21_resources
>> 16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory: 
>> /tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21
>> 16/01/19 21:43:32 INFO session.SessionState: Created local directory: 
>> /yarn/nm/usercache/julio/appcache/application_1453120455858_0040/container_1453120455858_0040_01_01/tmp/nobody/77350a02-d900-4c84-9456-134305044d21
>> 16/01/19 21:43:32 INFO session.SessionState: Created HDFS directory: 
>> /tmp/hive/nobody/77350a02-d900-4c84-9456-134305044d21/_tmp_space.db
>> 16/01/19 21:43:32 INFO parquet.ParquetRelation: Listing 
>> hdfs://namenode01:8020/user/julio/PFM/CDRs_parquet_np on driver
>> 16/01/19 21:43:33 INFO spark.SparkContext: Starting job: table at 
>> code.scala:13
>> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Got job 0 (table at 
>> code.scala:13) with 8 output partitions
>> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Final stage: ResultStage 
>> 0(table at code.scala:13)
>> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Parents of final stage: List()
>> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Missing parents: List()
>> 16/01/19 21:43:33 INFO scheduler.DAGScheduler: Submitting ResultStage 0 
>> (MapPartitionsRDD[1] at table at code.scala:13), which has no missing parents
>> Exception in thread "dag-scheduler-event-loop" 
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "dag-scheduler-event-loop"
>> Exception in thread "SparkListenerBus" 
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "SparkListenerBus"
>> It happens with whatever program I build, for example:
>> 
>> object MainClass {
>> def main(args:Array[String]):Unit = {
>> val conf = (new org.apache.spark.SparkConf()
>>  .setAppName("test")
>>  )
>> 
>> val sc = new org.apache.spark.SparkContext(conf)
>> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> 
>> val rdd = (sqlContext.read.table("cdrs_np")
>> .na.drop(how="any")
>> .map(_.toSeq.map(y=>y.toString))
>> .map(x=>(x.head,x.tail)
>> )
>> 
>> rdd.saveAsTextFile(args(0))
>> }
>> }
>> 
>> The command I'm using in spark-submit is the following:
>> 
>> spark-submit --master yarn \
>>  --deploy-mode cluster \
>>  --driver-memory 1G \
>>  --executor-memory 3000m \
>>  --executor-cores 1 \
>>  --num-executors 8 \
>>  --class MainClass \
>>  spark-yarn-cluster-test_2.10-0.1.jar \
>>  hdfs://namenode01/etl/test
>> 
>> I've got more than enough resources in my cluster in order to run the job 
>> (in fact, the exact same command works in --deploy-mode client).
>> 
>> I tried to increase yarn.app.mapreduce.am.resource.mb to 2GB, but that 
>> didn't work. I guess there is another parameter I should tweak, but I have 
>> not found any info whatsoever in the Internet.
>> 
>> I'm running Spark 1.5.2 and YARN from Hadoop 2.6.0-cdh5.5.1.
>> 
>> 
>> Any help would be greatly appreciated!
>> 
>> Thank you.
>> 
>> -- 
>> Julio Antonio Soto de Vicente
>