foreach is an action, from the source code you can see that it call runJob
method. In spark, it is difficult to change data in place, for it has a
functional semantic.

I think "mapPartitions" is more suitable for machine learning algorithms. I
am writing a LDA for mllib, you can have a look if you like, but not very
deep optimized yet. I will do more extra work to optimize it.

https://github.com/yinxusen/incubator-spark/blob/lda-mahout/mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala



2014/1/24 guojc <guoj...@gmail.com>

> Yes, I means Gibbs sampling. From the api document, I don't see why the
> data will be collected to driver. The document say that '
> def foreach(f: (T) => Unit): Unit
> Applies a function f to all elements of this RDD.'
>
> So If I want to change my data in place, what operation I should use?
>
> Best Regards,
> Jiacheng Guo
>
>
> On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <yinxu...@gmail.com> wrote:
>
>> Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will
>> collect all data from workers to driver. You will get OOM complained by JVM.
>>
>> I am not very sure of your implementation, but if data not need to join
>> together, you'd better keep them in workers.
>>
>>
>> 2014/1/24 guojc <guoj...@gmail.com>
>>
>>> Hi,
>>>    I'm writing a paralell mcmc program that having a very large dataset
>>> in memory, and need to update the dataset in-memory and avoid creating
>>> additional copy. Should I choose a foreach operation on rdd to express the
>>> change? or I have to create a new rdd after each sampling process?
>>>
>>> Thanks,
>>> Jiacheng Guo
>>>
>>
>>
>>
>> --
>> Best Regards
>> -----------------------------------
>> Xusen Yin    尹绪森
>> Beijing Key Laboratory of Intelligent Telecommunications Software and
>> Multimedia
>> Beijing University of Posts & Telecommunications
>> Intel Labs China
>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>>
>
>


-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and
Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Reply via email to