Hi, I am new to spark. This is the first time I am posting here. Currently, I try to implement ADMM optimization algorithms for Lasso/SVM Then I come across a problem:
Since the training data(label, feature) is large, so I created a RDD and cached the training data(label, feature ) in memory. Then for ADMM, it needs to keep local parameters (u,v) (which are different for each partition ). For each iteration, I need to use the training data(only on that partition), u, v to calculate the new value for u and v. Question1: One way is to zip (training data, u, v) into a rdd and update it in each iteration, but as we can see, training data is large and won't change for the whole time, only u, v (is small) are changed in each iteration. If I zip these three, I could not cache that rdd (since it changed for every iteration). But if did not cache that, I need to reuse the training data every iteration, how could I do it? Question2: Related to Question1, on the online documents, it said if we don't cache the rdd, it will not in the memory. And rdd uses delayed operation, then I am confused when can I view a previous rdd in memroy. Case1: B = A.map(function1). B.collect() #This forces B to be calculated ? After that, the node just release B since it is not cached ??? D = B.map(function3) D.collect() Case2: B = A.map(function1). D = B.map(function3) D.collect() Case3: B = A.map(function1). C = A.map(function2) D = B.map(function3) D.collect() In which case, can I view B is in memory in each cluster when I calculate D? Question3: can I use a function to do operations on two rdds? E.g Function newfun(rdd1, rdd2) #rdd1 is large and do not change for the whole time (training data), which I can use cache #rdd2 is small and change in each iteration (u, v ) Questions4: Or are there other ways to solve this kind of problem? I think this is common problem, but I could not find any good solutions. Thanks a lot Han -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org