Hi, Please have a look at the implementation for the RBM here<http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm>. Following the last discussion, where a user-parallized RBM was suggested as the right way to proceed, I have worked towards that. One issue which Jake predicted would arise is the distribution of the bias & weight matrices, and that is the exact issue I am having now. The original algorithm is here<http://www.cs.toronto.edu/~hinton/code/rbm.m>, and I have detailed the data structures below for ease -
vishid[numItems][softmax][totalFeatures]; DistCache visbiases[numItems][softmax]; DistCache hidbiases[totalFeatures]; DistCache CDpos[numItems][softmax][totalFeatures]; X 0 CDneg[numItems][softmax][totalFeatures]; X 0 CDinc[numItems][softmax][totalFeatures]; DistCache poshidprobs[totalFeatures]; X Depend on DistCache variables poshidstates[totalFeatures]; X Depend on DistCache variables curposhidstates[totalFeatures]; X Depend on DistCache variables poshidact[totalFeatures]; X 0 neghidact[totalFeatures]; X 0 neghidprobs[totalFeatures]; X Depend on DistCache variables neghidstates[totalFeatures]; X Depend on DistCache variables hidbiasinc[totalFeatures]; DistCache nvp2[numItems][softmax]; X 0 negvisprobs[numItems][softmax]; X 0 negvissoftmax[numItems]; X 0 posvisact[numItems][softmax]; X 0 negvisact[numItems][softmax]; X 0 visbiasinc[numItems][softmax]; DistCache The red-marked elements have to be distributed by *DistCache* or a similar mechanism to the user-oriented Mappers. The green-marked elements with X 0 mean that they are set to 0 before the user-oriented MapReduce section begins. The green-marked elements with *X Depend on DistCache* imply that they can be filled in in the Mapper, but are dependent on *DistCache* or equivalently supplied variables. 1. For initializing the Mapper, I saw that the LDA code was creating a LDAState and supplying it as a part of the configuration - so I took that hint and put all the initial data elements into an RBMState. However, the issue is that while going to the Reducer and returning back to the main code, I have to do multiple things - - Send *all* the red-marked data structures in some form to the Reducer, average them across all users -> generating one version of all the red-marked data structures, and then return them to the main code for updating the in-memory *RBMState*. This updated *RBMState* is then passed out to the next map-reduce iteration. (All map-reduce iterations are mapped user-wise.) - Send the variables, *nrmse* and *prmse* to the main code because the stop condition for the iterations depends on them. I am confused about how to do this. Should I create a separate Writable for RBMState or for the red-marked data structures? Will chaining jobs help anyway in simplifying this? Is it better to initialize the mappers by sending in the red-marked data-structures as DistributedCache, or send them as a part of Mapper.setup(Context)? If we can pass *RBMState* from *RBMDriver* to *RBMMapper* using * RBMMapper.setup(Context)*, then why can't we use *RBMReducer.setup(Context)*to pass the *RBMState* too? I am confused about how to do this. The logic of the algorithm is in * RBMMapper.* * * *2. *Right now I am using *RBMInputMapper* and *RBMInputReducer *to read in a CSV file and load it as a DistributedRowMatrix. What role does the *DataModel *play in this context? I can estimatePrefs and so on, using the the output of *RBMDriver. *Is it safe to ignore *DataModel*? 3. Is it better to spin off the core algorithm into some other place, and then just put the Recommender interface in o.a.m.cf.taste? Future additions like stacked RBMs etc. can go there without interrupting the RBMRecommender interface. Thanks, Sisir
