发自我的 iPad
> 在 2014年11月24日,上午9:41,zh8788 <78343...@qq.com> 写道: > > Hi, > > I am new to spark. This is the first time I am posting here. Currently, I > try to implement ADMM optimization algorithms for Lasso/SVM > Then I come across a problem: > > Since the training data(label, feature) is large, so I created a RDD and > cached the training data(label, feature ) in memory. Then for ADMM, it > needs to keep local parameters (u,v) (which are different for each > partition ). For each iteration, I need to use the training data(only on > that partition), u, v to calculate the new value for u and v. > RDD has a transform named mapPartitions(), it runs separately on each partition of RDD. > Question1: > > One way is to zip (training data, u, v) into a rdd and update it in each > iteration, but as we can see, training data is large and won't change for > the whole time, only u, v (is small) are changed in each iteration. If I zip > these three, I could not cache that rdd (since it changed for every > iteration). But if did not cache that, I need to reuse the training data > every iteration, how could I do it? > > Question2: > > Related to Question1, on the online documents, it said if we don't cache the > rdd, it will not in the memory. And rdd uses delayed operation, then I am > confused when can I view a previous rdd in memroy. > > Case1: > > B = A.map(function1). > B.collect() #This forces B to be calculated ? After that, the node just > release B since it is not cached ??? > D = B.map(function3) > D.collect() > > Case2: > B = A.map(function1). > D = B.map(function3) > D.collect() > > Case3: > > B = A.map(function1). > C = A.map(function2) > D = B.map(function3) > D.collect() > > In which case, can I view B is in memory in each cluster when I calculate > D? > If you want a certain RDD store in memory, use RDD.persistent(MEMORY_ONLY). Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. > Question3: > > can I use a function to do operations on two rdds? Yes, but it can only be executed in driver. > > E.g Function newfun(rdd1, rdd2) > #rdd1 is large and do not change for the whole time (training data), which I > can use cache > #rdd2 is small and change in each iteration (u, v ) > > > Questions4: > > Or are there other ways to solve this kind of problem? I think this is > common problem, but I could not find any good solutions. > > > Thanks a lot > > Han > > > > > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org