> The issue with this kind of program is that there is usually a part of the > data that is bounded in size > (the model) and a part of the data that is unbounded in size (the input > data). The bounded portion is > usually what is stored in distributed cache, even if not all mappers read > all of that data. > > The unbounded part is normally then parceled out to the different data > nodes. > > Why is your program the opposite? Need it be? Should it be?
It isn't the opposite. The input data is distributed. I'm not sure what you mean by model here, but if that's the model to be estimated, in case of linear regression the coefficients, for the locally weighted version this doesn't make sense as for each prediction input a different problem needs to be solved. Maybe I should explain in detail what's happening: For linear regression, we have a matrix X that contains the input vectors as rows and a column vector y that contains the dependent value (the target). Now we want to find a set of coefficients theta such that (X * theta - y)^2 --> min Differentiating this w.r.t. theta and setting the result equal to zero, one gets theta = inv(X' * X) * X' * y If one has X and y, using Matlab's backslash operator does exactly that (therefore in Matlab one can do theta = X \ y). So that would give us the optimal coefficients for that problem w.r.t. the squared error. In order to transform this into a task suitable for map-reduce, the equation is split into theta = A * b, where A = inv(X' * X) and b = X' * y. X can be very large, its number of rows is the number of training samples, the number of columns is the input dimensionality. Compared to X, A is usually small with number of rows and cols each the input dimensionality. The mappers determine partial sums of matrix A and vector b using a subset of X and y (a number of rows), let's call them A_i and b_i. The reducer then sums all matrices A_i to the final A and does the same with b_i. It finally computes theta = inv(A) * b. In the case of linear regression, theta can be used to make multiple predictions (as X * theta), which could be done in parallel using mappers, passing theta using the distributed cache for instance. In the case of locally weighted regression, however, the prediction input already influences the determination of theta. Suppose we're looking for the output y_query at some x_query and have examples x_0,...,x_N --> y_0,...y_N. Again, the x_i go into a matrix X and the y_i into a column vector y. Additionally, each row of X and each row of y are multiplied with the result of a function weight(x_i, x_query). Usually we want examples close to x_query to get a higher weight than those further away. If we formulate this as matrices again, we get theta = inv(X' * W * X) * X' * W * y, where W is a diagonal matrix containing the weight of each example, i.e., W(x_i, x_i) = weight(x_i, x_query). The map-reduce formulation is almost identical, except that now the mappers first calculate weight(x_i, x_query) and then use this to determine their A_i and b_i. The task of the reducer is unchanged. The resulting theta is used to make only one prediction, namely y_query = x_query * theta. I hope that now it becomes clear why I need the prediction input in the mappers. Unless one has millions of training examples, using map-reduce here probably wouldn't make sense to make just one prediction. >> shouldn't do the inversion literally. I'm now using Colt's Algebra.solve >> as t = Algebra.solve(A, b). [...] > Is that code using LU decomposition or QR to do the solving? I think in my case it uses LU, but I would have to look that. I'm not at home right now where the computer with the code is. [kernels] >> other algorithms will make use of kernels as well. > That sounds useful, but for now the LWLR package is a fine place for that. Alright. >> Moreover, I had to >> enable reading/writing of matrices using sequence files, I think I will >> make a separate patch for that. > Isn't there a MatrixWritable class for this? Yeah, but it actually doesn't do anything. There are just commented matrix.write(out) and matrix.load(in) statements. I replaced those by something similar to the code of VectorWritable. I think the idea there was to let the actually used matrix implementation take care of the details, but that wasn't done for Vectors and I think it's better to have the details about input/output stream handling in the *Writable classes. Otherwise we would create dependencies on the stream classes in the Vector/Matrix classes. Also, there's no test for MatrixWritable. I think tonight or tomorrow I can submit a patch containing both, a MatrixWritable that actually writes and reads and also a test for that class. Cheers, Alex
