Confirm, having the same issue (1.4.1 mllib package). For smaller dataset accuracy degradeted also. Haven’t tested yet in 1.5 with ml package implementation.

|val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.setNumIterations(30) boostingStrategy.setLearningRate(1.0) boostingStrategy.treeStrategy.setMaxDepth(3) boostingStrategy.treeStrategy.setMaxBins(128) boostingStrategy.treeStrategy.setSubsamplingRate(1.0) boostingStrategy.treeStrategy.setMinInstancesPerNode(1) boostingStrategy.treeStrategy.setUseNodeIdCache(true) boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, java.lang.Integer]]) val model = GradientBoostedTrees.train(instances, boostingStrategy) |

Thanks,
Peter Rudenko

On 2015-08-14 00:33, Sean Owen wrote:

Not that I have any answer at this point, but I was discussing this
exact same problem with Johannes today. An input size of ~20K records
was growing each iteration by ~15M records. I could not see why on a
first look.

@jkbradley I know it's not much info but does that ring any bells? I
think Johannes even has an instance of this up and running for
examination.

On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes
<mfor...@twitter.com.invalid> wrote:
I am training a boosted trees model on a couple million input samples (with
around 300 features) and am noticing that the input size of each stage is
increasing each iteration. For each new tree, the first step seems to be
building the decision tree metadata, which does a .count() on the input
data, so this is the step I've been using to track the input size changing.
Here is what I'm seeing:

count at DecisionTreeMetadata.scala:111
1. Input Size / Records: 726.1 MB / 1295620
2. Input Size / Records: 106.9 GB / 64780816
3. Input Size / Records: 160.3 GB / 97171224
4. Input Size / Records: 214.8 GB / 129680959
5. Input Size / Records: 268.5 GB / 162533424
....
Input Size / Records: 1912.6 GB / 1382017686
....

This step goes from taking less than 10s up to 5 minutes by the 15th or so
iteration. I'm not quite sure what could be causing this. I am passing a
memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train

Does anybody have some insight? Is this a bug or could it be an error on my
part?
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to