Hello everybody, I am wondering how Spark handles via HDFS his RDD, what if during a map phase I need data which are not present locally?
What I am working on : I am working on a recommendation algorithm : Matrix Factorization (MF) using a stochastic gradient as optimizer. For now my algorithm works locally but to anticipate further needs I would like to parallelized it using spark 0.9.0 on HDFS (without yarn). I saw the regression logistic (RL) SGD example in the MLibs. Matrix Factorization can be view as multiple regression logistic iteration, so I will follow the example to implement it. The only difference is : my dataset is composed by 3 files : User.csv -> (UserID age sex..) Item.csv -> (ItemID color size..) Obs.csv -> (UserID, ItemID, ratings) What I understand : In the RL example we have only the 'Obs.csv' file. Given that we have 3 machines, the file will be divided on 3 machines, during the map phase, the RL algorithm will be respectively executed on the 3 slaves with local data. So each RL will process 1/3 of the data. During the reduce phase, we just average the result returned by each slave. No network communication is needed during the RL process except the reduce step. All the data during map phase used/needed are local. What I am wondering : In my case my MF needs on each machine all the informations of the 'User.csv' file, 1/3 'Item.csv' file and 1/3 obs.csv to operate. When HDFS distributes my 3 files, I will have 1/3 of each file on each datanode. 1.What will happen when my MF algorithm is executed on each node? 2.Network communications will be needed for the 2/3 of the user.csv, right? 3.Will network communications be optimized as following : During an actual computation does the data needed for the next computation will be loaded (so that the time taking by the network communication won't affect the computation time)? Any help is highly appreciated. Best regards, Germain Tanguy. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-RDD-via-HDFS-tp4003.html Sent from the Apache Spark User List mailing list archive at Nabble.com.