RE: [MLlib] Performance issues when building GBM models
4999 is 160 bytes 15/02/09 19:45:29 INFO storage.BlockManagerInfo: Added broadcast_18001_piece0 in memory on hadoop-013:50803 (size: 3.8 KB, free: 10.3 GB) 15/02/09 19:45:29 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 4999 to sparkExecutor@hadoop-013:45260 15/02/09 19:45:29 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 13000.0 (TID 25000) in 11 ms on hadoop-011 (1/2) 15/02/09 19:45:29 INFO storage.BlockManagerInfo: Added broadcast_17999_piece0 in memory on hadoop-013:50803 (size: 81.0 B, free: 10.3 GB) 15/02/09 19:45:29 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 13000.0 (TID 24999) in 26 ms on hadoop-013 (2/2) 15/02/09 19:45:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 13000.0, whose tasks have all completed, from pool 15/02/09 19:45:29 INFO scheduler.DAGScheduler: Stage 13000 (collectAsMap at DecisionTree.scala:646) finished in 0.026 s 15/02/09 19:45:29 INFO scheduler.DAGScheduler: Job 8000 finished: collectAsMap at DecisionTree.scala:646, took 0.342683 s 15/02/09 19:45:29 INFO rdd.MapPartitionsRDD: Removing RDD 19988 from persistence list 15/02/09 19:45:29 INFO storage.BlockManager: Removing RDD 19988 15/02/09 19:45:29 INFO tree.RandomForest: Internal timing for DecisionTree: 15/02/09 19:45:29 INFO tree.RandomForest: init: 9.903233409 total: 15.855226062 findSplitsBins: 4.557418734 findBestSplits: 5.928304151 chooseSplits: 5.927796717 15/02/09 19:45:29 INFO tree.GradientBoostedTrees: Internal timing for DecisionTree: 15/02/09 19:45:29 INFO tree.GradientBoostedTrees: building tree 584: 9.53796807 building tree 303: 5.870926773 building tree 293: 5.379115341 building tree 599: 9.263506141 building tree 479: 7.648729795 -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, 10 February 2015 7:07 AM To: Christopher Thom Cc: user@spark.apache.org Subject: Re: [MLlib] Performance issues when building GBM models Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple times, which could be improved. -Xiangrui On Sun, Feb 8, 2015 at 5:32 PM, Christopher Thom wrote: > Hi All, > > I wonder if anyone else has some experience building a Gradient Boosted Trees > model using spark/mllib? I have noticed when building decent-size models that > the process slows down over time. We observe that the time to build tree n is > approximately a constant time longer than the time to build tree n-1 i.e. > t(n) = t(n-1) + const. The implication is that the total build time goes as > something like N^2, where N is the total number of trees. I would expect that > the algorithm should be approximately linear in total time (i.e. each > boosting iteration takes roughly the same time to complete). > > So I have a couple of questions: > 1. Is this behaviour expected, or consistent with what others are seeing? > 2. Does anyone know if there a tuning parameters (e.g. in the boosting > strategy, or tree stategy) that may be impacting this? > > All aspects of the build seem to slow down as I go. Here's a random example > culled from the logs, from the beginning and end of the model build: > > 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count > at DecisionTreeMetadata.scala:111, took 0.077957 s > 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: > count at DecisionTreeMetadata.scala:111, took 5.495166 s > > Any thoughts or advice, or even suggestions on where to dig for more info > would be welcome. > > thanks > chris > > Christopher Thom > > QUANTIUM > Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 > > T: +61 2 8222 3577 > F: +61 2 9292 6444 > > W: quantium.com.au > > > > linkedin.com/company/quantium > > facebook.com/QuantiumAustralia > > twitter.com/QuantiumAU > > > The contents of this email, including attachments, may be confidential > information. If you are not the intended recipient, any use, disclosure or > copying of the information is unauthorised. If you have received this email > in error, we would be grateful if you would notify us immediately by email > reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the > message from your system. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For > additional commands, e-mail: user-h...@spark.apache.org > Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8222 3577 F: +61
Re: [MLlib] Performance issues when building GBM models
Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple times, which could be improved. -Xiangrui On Sun, Feb 8, 2015 at 5:32 PM, Christopher Thom wrote: > Hi All, > > I wonder if anyone else has some experience building a Gradient Boosted Trees > model using spark/mllib? I have noticed when building decent-size models that > the process slows down over time. We observe that the time to build tree n is > approximately a constant time longer than the time to build tree n-1 i.e. > t(n) = t(n-1) + const. The implication is that the total build time goes as > something like N^2, where N is the total number of trees. I would expect that > the algorithm should be approximately linear in total time (i.e. each > boosting iteration takes roughly the same time to complete). > > So I have a couple of questions: > 1. Is this behaviour expected, or consistent with what others are seeing? > 2. Does anyone know if there a tuning parameters (e.g. in the boosting > strategy, or tree stategy) that may be impacting this? > > All aspects of the build seem to slow down as I go. Here's a random example > culled from the logs, from the beginning and end of the model build: > > 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count at > DecisionTreeMetadata.scala:111, took 0.077957 s > > 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: count at > DecisionTreeMetadata.scala:111, took 5.495166 s > > Any thoughts or advice, or even suggestions on where to dig for more info > would be welcome. > > thanks > chris > > Christopher Thom > > QUANTIUM > Level 25, 8 Chifley, 8-12 Chifley Square > Sydney NSW 2000 > > T: +61 2 8222 3577 > F: +61 2 9292 6444 > > W: quantium.com.au > > > > linkedin.com/company/quantium > > facebook.com/QuantiumAustralia > > twitter.com/QuantiumAU > > > The contents of this email, including attachments, may be confidential > information. If you are not the intended recipient, any use, disclosure or > copying of the information is unauthorised. If you have received this email > in error, we would be grateful if you would notify us immediately by email > reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the > message from your system. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[MLlib] Performance issues when building GBM models
Hi All, I wonder if anyone else has some experience building a Gradient Boosted Trees model using spark/mllib? I have noticed when building decent-size models that the process slows down over time. We observe that the time to build tree n is approximately a constant time longer than the time to build tree n-1 i.e. t(n) = t(n-1) + const. The implication is that the total build time goes as something like N^2, where N is the total number of trees. I would expect that the algorithm should be approximately linear in total time (i.e. each boosting iteration takes roughly the same time to complete). So I have a couple of questions: 1. Is this behaviour expected, or consistent with what others are seeing? 2. Does anyone know if there a tuning parameters (e.g. in the boosting strategy, or tree stategy) that may be impacting this? All aspects of the build seem to slow down as I go. Here's a random example culled from the logs, from the beginning and end of the model build: 15/02/09 17:22:11 INFO scheduler.DAGScheduler: Job 42 finished: count at DecisionTreeMetadata.scala:111, took 0.077957 s 15/02/09 19:44:01 INFO scheduler.DAGScheduler: Job 7954 finished: count at DecisionTreeMetadata.scala:111, took 5.495166 s Any thoughts or advice, or even suggestions on where to dig for more info would be welcome. thanks chris Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8222 3577 F: +61 2 9292 6444 W: quantium.com.au linkedin.com/company/quantium facebook.com/QuantiumAustralia twitter.com/QuantiumAU The contents of this email, including attachments, may be confidential information. If you are not the intended recipient, any use, disclosure or copying of the information is unauthorised. If you have received this email in error, we would be grateful if you would notify us immediately by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from your system. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org