Re: Input size increasing every iteration of gradient boosted trees [1.4]

Sean Owen Thu, 03 Sep 2015 06:54:17 -0700

Since it sounds like this has been encountered 3 times, and I've
personally seen it and mostly verified it, I think it's legit enough
for a JIRA: SPARK-10433   I am sorry to say I don't know what is going
here though.


On Thu, Sep 3, 2015 at 1:56 PM, Peter Rudenko <petro.rude...@gmail.com> wrote:
> Confirm, having the same issue (1.4.1 mllib package). For smaller dataset
> accuracy degradeted also. Haven’t tested yet in 1.5 with ml package
> implementation.
>
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
>     boostingStrategy.setNumIterations(30)
>     boostingStrategy.setLearningRate(1.0)
>     boostingStrategy.treeStrategy.setMaxDepth(3)
>     boostingStrategy.treeStrategy.setMaxBins(128)
>     boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
>     boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
>     boostingStrategy.treeStrategy.setUseNodeIdCache(true)
>     boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
> java.lang.Integer]])
>
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
>
> Thanks,
> Peter Rudenko
>
> On 2015-08-14 00:33, Sean Owen wrote:
>
> Not that I have any answer at this point, but I was discussing this
> exact same problem with Johannes today. An input size of ~20K records
> was growing each iteration by ~15M records. I could not see why on a
> first look.
>
> @jkbradley I know it's not much info but does that ring any bells? I
> think Johannes even has an instance of this up and running for
> examination.
>
> On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes
> <mfor...@twitter.com.invalid> wrote:
>
> I am training a boosted trees model on a couple million input samples (with
> around 300 features) and am noticing that the input size of each stage is
> increasing each iteration. For each new tree, the first step seems to be
> building the decision tree metadata, which does a .count() on the input
> data, so this is the step I've been using to track the input size changing.
> Here is what I'm seeing:
>
> count at DecisionTreeMetadata.scala:111
> 1. Input Size / Records: 726.1 MB / 1295620
> 2. Input Size / Records: 106.9 GB / 64780816
> 3. Input Size / Records: 160.3 GB / 97171224
> 4. Input Size / Records: 214.8 GB / 129680959
> 5. Input Size / Records: 268.5 GB / 162533424
> ....
> Input Size / Records: 1912.6 GB / 1382017686
> ....
>
> This step goes from taking less than 10s up to 5 minutes by the 15th or so
> iteration. I'm not quite sure what could be causing this. I am passing a
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train
>
> Does anybody have some insight? Is this a bug or could it be an error on my
> part?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Input size increasing every iteration of gradient boosted trees [1.4]

Reply via email to