Hello everybody,

I am wondering how Spark handles via HDFS his RDD, what if during a map
phase I need data which are not present locally?

What I am working on :
I am working on a recommendation algorithm : Matrix Factorization (MF) using
a stochastic gradient as optimizer. For now my algorithm works locally but
to anticipate further needs I would like to parallelized it using spark
0.9.0 on HDFS (without yarn).
I saw the regression logistic (RL) SGD example in the MLibs. Matrix
Factorization can be view as multiple regression logistic iteration, so I
will follow the example to implement it. The only difference is : my dataset
is composed by 3 files : 
        User.csv -> (UserID age sex..) 
        Item.csv -> (ItemID color size..) 
        Obs.csv -> (UserID, ItemID, ratings)

What I understand :
In the RL example we have only the 'Obs.csv' file. Given that we have 3
machines, the file will be divided on 3 machines, during the map phase, the
RL algorithm will be respectively executed on the 3 slaves with local data.
So each RL will process 1/3 of the data. During the reduce phase, we just
average the result returned by each slave. No network communication is
needed during the RL process except the reduce step. All the data during map
phase used/needed are local.

What I am wondering : 
In my case my MF needs on each machine all the informations of the
'User.csv' file, 1/3 'Item.csv' file and 1/3 obs.csv   to operate. When HDFS
distributes my 3 files, I will have 1/3 of each file on each datanode.  
1.What will happen when my MF algorithm is executed on each node? 
2.Network communications will be needed for the 2/3 of the user.csv, right?
3.Will network communications be optimized as following :
During an actual computation does the data needed for the next computation
will be loaded (so that the time taking by the network communication won't
affect the computation time)?

Any help is highly appreciated. 

Best regards,

Germain Tanguy.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-RDD-via-HDFS-tp4003.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to