It uses the standard SquaredL2Updater, and I also tried to broadcast it as
well.

The input is a RDD created by taking the union of several inputs, that have
all been run against MLUtils.kFold to produce even more RDDs. If I run with
10 different inputs, each with 10 kFolds. I'm pretty certain that all of
the input RDDs have clean closures. But I'm curious, is there a high
overhead for running union? Could that create larger task sizes?

Kyle



On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> I also did a quick glance through the code and couldn't find anything
> worrying that should be included in the task closures. The only possibly
> unsanitary part is the Updater you pass in -- what is your Updater and is
> it possible it's dragging in a significant amount of extra state?
>
>
> On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott <kellr...@soe.ucsc.edu>
> wrote:
>
>> I'm working of a patch to MLLib that allows for multiplexing several
>> different model optimization using the same RDD ( SPARK-2372:
>> https://issues.apache.org/jira/browse/SPARK-2372 )
>>
>> In testing larger datasets, I've started to see some memory errors (
>> java.lang.OutOfMemoryError and "exceeds max allowed: spark.akka.frameSize"
>> errors ).
>> My main clue is that Spark will start logging warning on smaller systems
>> like:
>>
>> 14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
>> task of very large size (10119 KB). The maximum recommended task size is
>> 100 KB.
>>
>> Looking up start '2862' in the case leads to a 'sample at
>> GroupedGradientDescent.scala:156' call. That code can be seen at
>>
>> https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156
>>
>> I've looked over the code, I'm broadcasting the larger variables, and
>> between the sampler and the combineByKey, I wouldn't think there much data
>> being moved over the network, much less a 10MB chunk.
>>
>> Any ideas of what this might be a symptom of?
>>
>> Kyle
>>
>>
>

Reply via email to