Sean Owen created SPARK-10433:
---------------------------------

             Summary: Gradient boosted trees
                 Key: SPARK-10433
                 URL: https://issues.apache.org/jira/browse/SPARK-10433
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.4.1, 1.5.0
            Reporter: Sean Owen


(Sorry to say I don't have any leads on a fix, but this was reported by three 
different people and I confirmed it at fairly close range, so think it's 
legitimate:)

This is probably best explained in the words from the mailing list thread at 
http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
 . Matt Forbes says:

{quote}
I am training a boosted trees model on a couple million input samples (with 
around 300 features) and am noticing that the input size of each stage is 
increasing each iteration. For each new tree, the first step seems to be 
building the decision tree metadata, which does a .count() on the input data, 
so this is the step I've been using to track the input size changing. Here is 
what I'm seeing: 
{quote}

{code}
count at DecisionTreeMetadata.scala:111 
1. Input Size / Records: 726.1 MB / 1295620 
2. Input Size / Records: 106.9 GB / 64780816 
3. Input Size / Records: 160.3 GB / 97171224 
4. Input Size / Records: 214.8 GB / 129680959 
5. Input Size / Records: 268.5 GB / 162533424 
.... 
Input Size / Records: 1912.6 GB / 1382017686 
.... 
{code}

{quote}
This step goes from taking less than 10s up to 5 minutes by the 15th or so 
iteration. I'm not quite sure what could be causing this. I am passing a 
memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
{quote}

Johannes Bauer showed me a very similar problem.

Peter Rudenko offers this sketch of a reproduction:

{code}
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
    boostingStrategy.setNumIterations(30)
    boostingStrategy.setLearningRate(1.0)
    boostingStrategy.treeStrategy.setMaxDepth(3)
    boostingStrategy.treeStrategy.setMaxBins(128)
    boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
    boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
    boostingStrategy.treeStrategy.setUseNodeIdCache(true)
    boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
      
mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, 
java.lang.Integer]])

val model = GradientBoostedTrees.train(instances, boostingStrategy)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to