RE: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Christopher Thom
4999 is 160 bytes
15/02/09 19:45:29 INFO storage.BlockManagerInfo: Added broadcast_18001_piece0 
in memory on hadoop-013:50803 (size: 3.8 KB, free: 10.3 GB)
15/02/09 19:45:29 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 4999 to sparkExecutor@hadoop-013:45260
15/02/09 19:45:29 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 
13000.0 (TID 25000) in 11 ms on hadoop-011 (1/2)
15/02/09 19:45:29 INFO storage.BlockManagerInfo: Added broadcast_17999_piece0 
in memory on hadoop-013:50803 (size: 81.0 B, free: 10.3 GB)
15/02/09 19:45:29 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
13000.0 (TID 24999) in 26 ms on hadoop-013 (2/2)
15/02/09 19:45:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 13000.0, 
whose tasks have all completed, from pool
15/02/09 19:45:29 INFO scheduler.DAGScheduler: Stage 13000 (collectAsMap at 
DecisionTree.scala:646) finished in 0.026 s
15/02/09 19:45:29 INFO scheduler.DAGScheduler: Job 8000 finished: collectAsMap 
at DecisionTree.scala:646, took 0.342683 s
15/02/09 19:45:29 INFO rdd.MapPartitionsRDD: Removing RDD 19988 from 
persistence list
15/02/09 19:45:29 INFO storage.BlockManager: Removing RDD 19988
15/02/09 19:45:29 INFO tree.RandomForest: Internal timing for DecisionTree:
15/02/09 19:45:29 INFO tree.RandomForest:   init: 9.903233409
  total: 15.855226062
  findSplitsBins: 4.557418734
  findBestSplits: 5.928304151
  chooseSplits: 5.927796717
15/02/09 19:45:29 INFO tree.GradientBoostedTrees: Internal timing for 
DecisionTree:
15/02/09 19:45:29 INFO tree.GradientBoostedTrees:   building tree 584: 
9.53796807
  building tree 303: 5.870926773
  building tree 293: 5.379115341
  building tree 599: 9.263506141
  building tree 479: 7.648729795

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Tuesday, 10 February 2015 7:07 AM
To: Christopher Thom
Cc: user@spark.apache.org
Subject: Re: [MLlib] Performance issues when building GBM models

Could you check the Spark UI and see whether there are RDDs being kicked out 
during the computation? We cache the residual RDD after each iteration. If we 
don't have enough memory/disk, it gets recomputed and results something like 
`t(n) = t(n-1) + const`. We might cache the features multiple times, which 
could be improved.
-Xiangrui

On Sun, Feb 8, 2015 at 5:32 PM, Christopher Thom 
 wrote:
> Hi All,
>
> I wonder if anyone else has some experience building a Gradient Boosted Trees 
> model using spark/mllib? I have noticed when building decent-size models that 
> the process slows down over time. We observe that the time to build tree n is 
> approximately a constant time longer than the time to build tree n-1 i.e. 
> t(n) = t(n-1) + const. The implication is that the total build time goes as 
> something like N^2, where N is the total number of trees. I would expect that 
> the algorithm should be approximately linear in total time (i.e. each 
> boosting iteration takes roughly the same time to complete).
>
> So I have a couple of questions:
> 1. Is this behaviour expected, or consistent with what others are seeing?
> 2. Does anyone know if there a tuning parameters (e.g. in the boosting 
> strategy, or tree stategy) that may be impacting this?
>
> All aspects of the build seem to slow down as I go. Here's a random example 
> culled from the logs, from the beginning and end of the model build:
>
> 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count
> at DecisionTreeMetadata.scala:111, took 0.077957 s 
> 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished:
> count at DecisionTreeMetadata.scala:111, took 5.495166 s
>
> Any thoughts or advice, or even suggestions on where to dig for more info 
> would be welcome.
>
> thanks
> chris
>
> Christopher Thom
>
> QUANTIUM
> Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000
>
> T: +61 2 8222 3577
> F: +61 2 9292 6444
>
> W: quantium.com.au
>
> 
>
> linkedin.com/company/quantium
>
> facebook.com/QuantiumAustralia
>
> twitter.com/QuantiumAU
>
>
> The contents of this email, including attachments, may be confidential 
> information. If you are not the intended recipient, any use, disclosure or 
> copying of the information is unauthorised. If you have received this email 
> in error, we would be grateful if you would notify us immediately by email 
> reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the 
> message from your system.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> additional commands, e-mail: user-h...@spark.apache.org
>

Christopher Thom

QUANTIUM
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8222 3577
F: +61

Re: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Xiangrui Meng
Could you check the Spark UI and see whether there are RDDs being
kicked out during the computation? We cache the residual RDD after
each iteration. If we don't have enough memory/disk, it gets
recomputed and results something like `t(n) = t(n-1) + const`. We
might cache the features multiple times, which could be improved.
-Xiangrui

On Sun, Feb 8, 2015 at 5:32 PM, Christopher Thom
 wrote:
> Hi All,
>
> I wonder if anyone else has some experience building a Gradient Boosted Trees 
> model using spark/mllib? I have noticed when building decent-size models that 
> the process slows down over time. We observe that the time to build tree n is 
> approximately a constant time longer than the time to build tree n-1 i.e. 
> t(n) = t(n-1) + const. The implication is that the total build time goes as 
> something like N^2, where N is the total number of trees. I would expect that 
> the algorithm should be approximately linear in total time (i.e. each 
> boosting iteration takes roughly the same time to complete).
>
> So I have a couple of questions:
> 1. Is this behaviour expected, or consistent with what others are seeing?
> 2. Does anyone know if there a tuning parameters (e.g. in the boosting 
> strategy, or tree stategy) that may be impacting this?
>
> All aspects of the build seem to slow down as I go. Here's a random example 
> culled from the logs, from the beginning and end of the model build:
>
> 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count at 
> DecisionTreeMetadata.scala:111, took 0.077957 s
> 
> 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: count at 
> DecisionTreeMetadata.scala:111, took 5.495166 s
>
> Any thoughts or advice, or even suggestions on where to dig for more info 
> would be welcome.
>
> thanks
> chris
>
> Christopher Thom
>
> QUANTIUM
> Level 25, 8 Chifley, 8-12 Chifley Square
> Sydney NSW 2000
>
> T: +61 2 8222 3577
> F: +61 2 9292 6444
>
> W: quantium.com.au
>
> 
>
> linkedin.com/company/quantium
>
> facebook.com/QuantiumAustralia
>
> twitter.com/QuantiumAU
>
>
> The contents of this email, including attachments, may be confidential 
> information. If you are not the intended recipient, any use, disclosure or 
> copying of the information is unauthorised. If you have received this email 
> in error, we would be grateful if you would notify us immediately by email 
> reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the 
> message from your system.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[MLlib] Performance issues when building GBM models

2015-02-08 Thread Christopher Thom
Hi All,

I wonder if anyone else has some experience building a Gradient Boosted Trees 
model using spark/mllib? I have noticed when building decent-size models that 
the process slows down over time. We observe that the time to build tree n is 
approximately a constant time longer than the time to build tree n-1 i.e. t(n) 
= t(n-1) + const. The implication is that the total build time goes as 
something like N^2, where N is the total number of trees. I would expect that 
the algorithm should be approximately linear in total time (i.e. each boosting 
iteration takes roughly the same time to complete).

So I have a couple of questions:
1. Is this behaviour expected, or consistent with what others are seeing?
2. Does anyone know if there a tuning parameters (e.g. in the boosting 
strategy, or tree stategy) that may be impacting this?

All aspects of the build seem to slow down as I go. Here's a random example 
culled from the logs, from the beginning and end of the model build:

15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count at 
DecisionTreeMetadata.scala:111, took 0.077957 s

15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: count at 
DecisionTreeMetadata.scala:111, took 5.495166 s

Any thoughts or advice, or even suggestions on where to dig for more info would 
be welcome.

thanks
chris

Christopher Thom

QUANTIUM
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8222 3577
F: +61 2 9292 6444

W: quantium.com.au



linkedin.com/company/quantium

facebook.com/QuantiumAustralia

twitter.com/QuantiumAU


The contents of this email, including attachments, may be confidential 
information. If you are not the intended recipient, any use, disclosure or 
copying of the information is unauthorised. If you have received this email in 
error, we would be grateful if you would notify us immediately by email reply, 
phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from 
your system.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org