Hi Sisir, the red and green didnt come(no html email). Could you separate out the variables.
Before pursuing this method like LDA, understand that this is not scalable as you keep huge vectors and matrices, that grows with size of data in memory. A better technique would be to make a matrix and operate on it row by row. That would involve some nijitsu in terms of breaking up the problem. If you still wish to continue this way, Make this into a writable, read it in setup() on mapper and send it to the reducer. In the reducer, take all the RBMState as values and sum the op one by one(the overall state updation step). and when the reducer finishes, write it to the disk. Use it again in the next iteration. This is exactly what LDA does. On Mon, Jun 28, 2010 at 12:30 AM, Sisir Koppaka <[email protected]>wrote: > Hi, > > Please have a look at the implementation for the RBM > here<http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm>. > Following the last discussion, where a user-parallized RBM was suggested as > the right way to proceed, I have worked towards that. One issue which Jake > predicted would arise is the distribution of the bias & weight matrices, > and > that is the exact issue I am having now. The original algorithm is > here<http://www.cs.toronto.edu/~hinton/code/rbm.m>, > and I have detailed the data structures below for ease - > > > vishid[numItems][softmax][totalFeatures]; DistCache > > visbiases[numItems][softmax]; DistCache > > hidbiases[totalFeatures]; DistCache > > CDpos[numItems][softmax][totalFeatures]; X 0 > > CDneg[numItems][softmax][totalFeatures]; X 0 > > CDinc[numItems][softmax][totalFeatures]; DistCache > > poshidprobs[totalFeatures]; X Depend on DistCache variables > > poshidstates[totalFeatures]; X Depend on DistCache variables > > curposhidstates[totalFeatures]; X Depend on DistCache variables > > poshidact[totalFeatures]; X 0 > > neghidact[totalFeatures]; X 0 > > neghidprobs[totalFeatures]; X Depend on DistCache variables > > neghidstates[totalFeatures]; X Depend on DistCache variables > > hidbiasinc[totalFeatures]; DistCache > > nvp2[numItems][softmax]; X 0 > > negvisprobs[numItems][softmax]; X 0 > > negvissoftmax[numItems]; X 0 > > posvisact[numItems][softmax]; X 0 > > negvisact[numItems][softmax]; X 0 > > visbiasinc[numItems][softmax]; DistCache > > > The red-marked elements have to be distributed by *DistCache* or a similar > mechanism to the user-oriented Mappers. The green-marked elements with X 0 > mean that they are set to 0 before the user-oriented MapReduce section > begins. The green-marked elements with *X Depend on DistCache* imply that > they can be filled in in the Mapper, but are dependent on *DistCache* or > equivalently supplied variables. > > > 1. For initializing the Mapper, I saw that the LDA code was creating a > LDAState and supplying it as a part of the configuration - so I took that > hint and put all the initial data elements into an RBMState. However, the > issue is that while going to the Reducer and returning back to the main > code, I have to do multiple things - > > > - Send *all* the red-marked data structures in some form to the Reducer, > average them across all users -> generating one version of all the > red-marked data structures, and then return them to the main code for > updating the in-memory *RBMState*. This updated *RBMState* is then passed > out to the next map-reduce iteration. (All map-reduce iterations are > mapped > user-wise.) > - Send the variables, *nrmse* and *prmse* to the main code because the > stop condition for the iterations depends on them. > > I am confused about how to do this. Should I create a separate Writable for > RBMState or for the red-marked data structures? Will chaining jobs help > anyway in simplifying this? Is it better to initialize the mappers by > sending in the red-marked data-structures as DistributedCache, or send them > as a part of Mapper.setup(Context)? > > If we can pass *RBMState* from *RBMDriver* to *RBMMapper* using * > RBMMapper.setup(Context)*, then why can't we use > *RBMReducer.setup(Context)*to pass the > *RBMState* too? > > I am confused about how to do this. The logic of the algorithm is in * > RBMMapper.* > * > * > *2. *Right now I am using *RBMInputMapper* and *RBMInputReducer *to read in > a CSV file and load it as a DistributedRowMatrix. What role does the > *DataModel > *play in this context? I can estimatePrefs and so on, using the the output > of *RBMDriver. *Is it safe to ignore *DataModel*? > > 3. Is it better to spin off the core algorithm into some other place, and > then just put the Recommender interface in o.a.m.cf.taste? Future additions > like stacked RBMs etc. can go there without interrupting the RBMRecommender > interface. > > Thanks, > Sisir >
