foreach is an action, from the source code you can see that it call runJob method. In spark, it is difficult to change data in place, for it has a functional semantic.
I think "mapPartitions" is more suitable for machine learning algorithms. I am writing a LDA for mllib, you can have a look if you like, but not very deep optimized yet. I will do more extra work to optimize it. https://github.com/yinxusen/incubator-spark/blob/lda-mahout/mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala 2014/1/24 guojc <guoj...@gmail.com> > Yes, I means Gibbs sampling. From the api document, I don't see why the > data will be collected to driver. The document say that ' > def foreach(f: (T) => Unit): Unit > Applies a function f to all elements of this RDD.' > > So If I want to change my data in place, what operation I should use? > > Best Regards, > Jiacheng Guo > > > On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <yinxu...@gmail.com> wrote: > >> Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will >> collect all data from workers to driver. You will get OOM complained by JVM. >> >> I am not very sure of your implementation, but if data not need to join >> together, you'd better keep them in workers. >> >> >> 2014/1/24 guojc <guoj...@gmail.com> >> >>> Hi, >>> I'm writing a paralell mcmc program that having a very large dataset >>> in memory, and need to update the dataset in-memory and avoid creating >>> additional copy. Should I choose a foreach operation on rdd to express the >>> change? or I have to create a new rdd after each sampling process? >>> >>> Thanks, >>> Jiacheng Guo >>> >> >> >> >> -- >> Best Regards >> ----------------------------------- >> Xusen Yin 尹绪森 >> Beijing Key Laboratory of Intelligent Telecommunications Software and >> Multimedia >> Beijing University of Posts & Telecommunications >> Intel Labs China >> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >> > > -- Best Regards ----------------------------------- Xusen Yin 尹绪森 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts & Telecommunications Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*