I'm experimenting with  Gradient Boosted Trees
<https://spark.apache.org/docs/1.2.0/mllib-ensembles.html>   learning
algorithm from ML library of Spark 1.4. I'm solving a binary classification
problem where my input is ~50,000 samples and ~500,000 features. My goal is
to output the definition of the resulting GBT ensemble in human-readable
format. My experience so far is that for my problem size adding more
resources to the cluster seems to not have an effect on the length of the
run. A 10-iteration training run seem to roughly take 13hrs. This isn't
acceptable since I'm looking to do 100-300 iteration runs, and the execution
time seems to explode with the number of iterations.

*My Spark application*
This isn't the exact code, but it can be reduced to:

    SparkConf sc = new SparkConf().setAppName("GBT Trainer")
                // unlimited max result size for intermediate Map-Reduce
ops.
                // Having no limit is probably bad, but I've not had time to
find
                // a tighter upper bound and the default value wasn't
sufficient.
                .set("spark.driver.maxResultSize", "0");
    JavaSparkContext jsc = new JavaSparkContext(sc)

    // The input file is encoded in plain-text LIBSVM format ~59GB in size
    <LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(),
"s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
        
    BoostingStrategy boostingStrategy =
BoostingStrategy.defaultParams("Classification");
    boostingStrategy.setNumIterations(10);
    boostingStrategy.getTreeStrategy().setNumClasses(2);
    boostingStrategy.getTreeStrategy().setMaxDepth(1);
    Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer,
Integer>();
   
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
        
    GradientBoostedTreesModel model = GradientBoostedTrees.train(data,
boostingStrategy);

    // Somewhat-convoluted code below reads in Parquete-formatted output
    // of the GBT model and writes it back out as json.
    // There might be cleaner ways of achieving the same, but since output
    // size is only a few KB I feel little guilt leaving it as is.

    // serialize and output the GBT classifier model the only way that the
library allows
    String outputPath = "s3://somebucket/somekeyprefex";
    model.save(jsc.sc(), outputPath + "/parquet");
    // read in the parquet-formatted classifier output as a generic
DataFrame object
    SQLContext sqlContext = new SQLContext(jsc);
    DataFrame outputDataFrame = sqlContext.read().parquet(outputPath +
"/parquet"));    
    // output DataFrame-formatted classifier model as json           
    outputDataFrame.write().format("json").save(outputPath + "/json");

*Question*
What is the performance bottleneck with my Spark application (or with GBT
learning algorithm itself) on input of that size and how can I achieve
greater execution parallelism?

I'm still a novice Spark dev, and I'd appreciate any tips on cluster
configuration and execution profiling. 


*More details on the cluster setup*

I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of
r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large
instances in order to maximize flexibility of resource allocation. So far
I've tried using 1-3 r3.8xlarge instances with a variety of resource
allocation schemes between the driver and workers. For example, for a
cluster of 1 r3.8xlarge instances I submit the app as follows:

    aws emr add-steps --cluster-id $1 --steps Name=$2,\
   
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
    Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
    --deploy-mode,cluster,--master,yarn,\
    --driver-memory,60G,\
    --executor-memory,30G,\
    --executor-cores,5,\
    --num-executors,6,\
    --class,GbtTrainer,\
    "s3://somebucket/somekey/spark.jar"],\
    ActionOnFailure=CONTINUE

For a cluster of 3 r3.8xlarge instances I tweak resource allocation:

    --driver-memory,80G,\
    --executor-memory,35G,\
    --executor-cores,5,\
    --num-executors,18,\

I don't have a clear idea of how much memory is useful to give to every
executor, but I feel that I'm being generous in either case. Looking through
Spark UI, I'm not seeing task with input size of more than a few GB. I'm
steering on the side of caution when giving the driver process so much
memory in order to ensure that it isn't memory starved for any intermediate
result-aggregation operations.

I'm trying to keep the number of cores per executor down to 5 as per
suggestions in  Clouderas How To Tune Your Spark Jobs series
<http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/>
  
(according to them, more that 5 cores tends to introduce a HDFS IO
bottleneck). I'm also making sure that there is enough of spare RAM and CPUs
left over for the host OS and Hadoop services.

*My findings thus far*
My only clue is Spark UI showing very long Scheduling Delay for a number of
tasks at the tail-end of execution. I also get the feeling that the
stages/tasks timeline shown by Spark UI does not account for all of the time
that the job takes to finish. I suspect that the driver application is stuck
performing some kind of a lengthy operation either at the end of every
training iteration, or at the end of the entire training run.

I've already done a fair bit of research on tuning Spark applications. Most
articles will give great suggestions on using RDD operations which reduce
intermediate input size or avoid shuffling of data between stages. In my
case I'm basically using an "out-of-the-box" algorithm, which is written by
ML experts and *should* already be well tuned in this regard. My own code
that outputs GBT model to S3 should take a trivial amount of time to run.

Appreciate your time!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Slow-Performance-with-Apache-Spark-Gradient-Boosted-Tree-training-runs-tp24758.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to