Ah, I didn't realize this was non-MLLib code. Do you mean to be
sending stochasticLossHistory
in the closure as well?


On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott <kellr...@soe.ucsc.edu> wrote:

> It uses the standard SquaredL2Updater, and I also tried to broadcast it as
> well.
>
> The input is a RDD created by taking the union of several inputs, that
> have all been run against MLUtils.kFold to produce even more RDDs. If I run
> with 10 different inputs, each with 10 kFolds. I'm pretty certain that all
> of the input RDDs have clean closures. But I'm curious, is there a high
> overhead for running union? Could that create larger task sizes?
>
> Kyle
>
>
>
> On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson <ilike...@gmail.com>
> wrote:
>
>> I also did a quick glance through the code and couldn't find anything
>> worrying that should be included in the task closures. The only possibly
>> unsanitary part is the Updater you pass in -- what is your Updater and is
>> it possible it's dragging in a significant amount of extra state?
>>
>>
>> On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott <kellr...@soe.ucsc.edu>
>> wrote:
>>
>>> I'm working of a patch to MLLib that allows for multiplexing several
>>> different model optimization using the same RDD ( SPARK-2372:
>>> https://issues.apache.org/jira/browse/SPARK-2372 )
>>>
>>> In testing larger datasets, I've started to see some memory errors (
>>> java.lang.OutOfMemoryError and "exceeds max allowed: spark.akka.frameSize"
>>> errors ).
>>> My main clue is that Spark will start logging warning on smaller systems
>>> like:
>>>
>>> 14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
>>> task of very large size (10119 KB). The maximum recommended task size is
>>> 100 KB.
>>>
>>> Looking up start '2862' in the case leads to a 'sample at
>>> GroupedGradientDescent.scala:156' call. That code can be seen at
>>>
>>> https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156
>>>
>>> I've looked over the code, I'm broadcasting the larger variables, and
>>> between the sampler and the combineByKey, I wouldn't think there much data
>>> being moved over the network, much less a 10MB chunk.
>>>
>>> Any ideas of what this might be a symptom of?
>>>
>>> Kyle
>>>
>>>
>>
>

Reply via email to