Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train
multiple models at the same time. I am hoping that by multiplexing several
models in the same RDD will be more efficient then trying to get the Spark
scheduler to manage a few 100 tasks simultaneously.

I don't think I see stochasticLossHistory being included in the closure
(please correct me if I'm wrong). Its used once on line 183 to capture the
loss sums (a local operation on the results of a 'collect' call), and again
on line 198 to update weightSet, but that's after the loop completes, and
the memory blow definitely happens before then.

Kyle



On Tue, Jul 15, 2014 at 12:00 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Ah, I didn't realize this was non-MLLib code. Do you mean to be sending 
> stochasticLossHistory
> in the closure as well?
>
>
> On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott <kellr...@soe.ucsc.edu>
> wrote:
>
>> It uses the standard SquaredL2Updater, and I also tried to broadcast it
>> as well.
>>
>> The input is a RDD created by taking the union of several inputs, that
>> have all been run against MLUtils.kFold to produce even more RDDs. If I run
>> with 10 different inputs, each with 10 kFolds. I'm pretty certain that all
>> of the input RDDs have clean closures. But I'm curious, is there a high
>> overhead for running union? Could that create larger task sizes?
>>
>> Kyle
>>
>>
>>
>> On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson <ilike...@gmail.com>
>> wrote:
>>
>>> I also did a quick glance through the code and couldn't find anything
>>> worrying that should be included in the task closures. The only possibly
>>> unsanitary part is the Updater you pass in -- what is your Updater and is
>>> it possible it's dragging in a significant amount of extra state?
>>>
>>>
>>> On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott <kellr...@soe.ucsc.edu>
>>> wrote:
>>>
>>>> I'm working of a patch to MLLib that allows for multiplexing several
>>>> different model optimization using the same RDD ( SPARK-2372:
>>>> https://issues.apache.org/jira/browse/SPARK-2372 )
>>>>
>>>> In testing larger datasets, I've started to see some memory errors (
>>>> java.lang.OutOfMemoryError and "exceeds max allowed: spark.akka.frameSize"
>>>> errors ).
>>>> My main clue is that Spark will start logging warning on smaller
>>>> systems like:
>>>>
>>>> 14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
>>>> task of very large size (10119 KB). The maximum recommended task size is
>>>> 100 KB.
>>>>
>>>> Looking up start '2862' in the case leads to a 'sample at
>>>> GroupedGradientDescent.scala:156' call. That code can be seen at
>>>>
>>>> https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156
>>>>
>>>> I've looked over the code, I'm broadcasting the larger variables, and
>>>> between the sampler and the combineByKey, I wouldn't think there much data
>>>> being moved over the network, much less a 10MB chunk.
>>>>
>>>> Any ideas of what this might be a symptom of?
>>>>
>>>> Kyle
>>>>
>>>>
>>>
>>
>

Reply via email to