Since it sounds like this has been encountered 3 times, and I've personally seen it and mostly verified it, I think it's legit enough for a JIRA: SPARK-10433 I am sorry to say I don't know what is going here though.
On Thu, Sep 3, 2015 at 1:56 PM, Peter Rudenko <petro.rude...@gmail.com> wrote: > Confirm, having the same issue (1.4.1 mllib package). For smaller dataset > accuracy degradeted also. Haven’t tested yet in 1.5 with ml package > implementation. > > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > > val model = GradientBoostedTrees.train(instances, boostingStrategy) > > Thanks, > Peter Rudenko > > On 2015-08-14 00:33, Sean Owen wrote: > > Not that I have any answer at this point, but I was discussing this > exact same problem with Johannes today. An input size of ~20K records > was growing each iteration by ~15M records. I could not see why on a > first look. > > @jkbradley I know it's not much info but does that ring any bells? I > think Johannes even has an instance of this up and running for > examination. > > On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes > <mfor...@twitter.com.invalid> wrote: > > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input > data, so this is the step I've been using to track the input size changing. > Here is what I'm seeing: > > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > .... > Input Size / Records: 1912.6 GB / 1382017686 > .... > > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > > Does anybody have some insight? Is this a bug or could it be an error on my > part? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org