Ah, I didn't realize this was non-MLLib code. Do you mean to be sending stochasticLossHistory in the closure as well?
On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott <kellr...@soe.ucsc.edu> wrote: > It uses the standard SquaredL2Updater, and I also tried to broadcast it as > well. > > The input is a RDD created by taking the union of several inputs, that > have all been run against MLUtils.kFold to produce even more RDDs. If I run > with 10 different inputs, each with 10 kFolds. I'm pretty certain that all > of the input RDDs have clean closures. But I'm curious, is there a high > overhead for running union? Could that create larger task sizes? > > Kyle > > > > On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> I also did a quick glance through the code and couldn't find anything >> worrying that should be included in the task closures. The only possibly >> unsanitary part is the Updater you pass in -- what is your Updater and is >> it possible it's dragging in a significant amount of extra state? >> >> >> On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott <kellr...@soe.ucsc.edu> >> wrote: >> >>> I'm working of a patch to MLLib that allows for multiplexing several >>> different model optimization using the same RDD ( SPARK-2372: >>> https://issues.apache.org/jira/browse/SPARK-2372 ) >>> >>> In testing larger datasets, I've started to see some memory errors ( >>> java.lang.OutOfMemoryError and "exceeds max allowed: spark.akka.frameSize" >>> errors ). >>> My main clue is that Spark will start logging warning on smaller systems >>> like: >>> >>> 14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a >>> task of very large size (10119 KB). The maximum recommended task size is >>> 100 KB. >>> >>> Looking up start '2862' in the case leads to a 'sample at >>> GroupedGradientDescent.scala:156' call. That code can be seen at >>> >>> https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156 >>> >>> I've looked over the code, I'm broadcasting the larger variables, and >>> between the sampler and the combineByKey, I wouldn't think there much data >>> being moved over the network, much less a 10MB chunk. >>> >>> Any ideas of what this might be a symptom of? >>> >>> Kyle >>> >>> >> >