Is this an artifact of a recent change? Does this not show up in any of the
tests or benchmarks?

On Thu, Aug 13, 2015 at 2:33 PM, Sean Owen <so...@cloudera.com> wrote:

> Not that I have any answer at this point, but I was discussing this
> exact same problem with Johannes today. An input size of ~20K records
> was growing each iteration by ~15M records. I could not see why on a
> first look.
>
> @jkbradley I know it's not much info but does that ring any bells? I
> think Johannes even has an instance of this up and running for
> examination.
>
> On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes
> <mfor...@twitter.com.invalid> wrote:
> > I am training a boosted trees model on a couple million input samples
> (with
> > around 300 features) and am noticing that the input size of each stage is
> > increasing each iteration. For each new tree, the first step seems to be
> > building the decision tree metadata, which does a .count() on the input
> > data, so this is the step I've been using to track the input size
> changing.
> > Here is what I'm seeing:
> >
> > count at DecisionTreeMetadata.scala:111
> > 1. Input Size / Records: 726.1 MB / 1295620
> > 2. Input Size / Records: 106.9 GB / 64780816
> > 3. Input Size / Records: 160.3 GB / 97171224
> > 4. Input Size / Records: 214.8 GB / 129680959
> > 5. Input Size / Records: 268.5 GB / 162533424
> > ....
> > Input Size / Records: 1912.6 GB / 1382017686
> > ....
> >
> > This step goes from taking less than 10s up to 5 minutes by the 15th or
> so
> > iteration. I'm not quite sure what could be causing this. I am passing a
> > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train
> >
> > Does anybody have some insight? Is this a bug or could it be an error on
> my
> > part?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to