> The issue with this kind of program is that there is usually a part of the
> data that is bounded in size
> (the model) and a part of the data that is unbounded in size (the input
> data).  The bounded portion is
> usually what is stored in distributed cache, even if not all mappers read
> all of that data.
>
> The unbounded part is normally then parceled out to the different data
> nodes.
>
> Why is your program the opposite?  Need it be?  Should it be?

It isn't the opposite. The input data is distributed. I'm not sure what
you mean by model here, but if that's the model to be estimated, in case
of linear regression the coefficients, for the locally weighted version
this doesn't make sense as for each prediction input a different problem
needs to be solved. Maybe I should explain in detail what's happening:

For linear regression, we have a matrix X that contains the input vectors
as rows and a column vector y that contains the dependent value (the
target). Now we want to find a set of coefficients theta such that

(X * theta - y)^2 --> min

Differentiating this w.r.t. theta and setting the result equal to zero,
one gets

theta = inv(X' * X) * X' * y

If one has X and y, using Matlab's backslash operator does exactly that
(therefore in Matlab one can do theta = X \ y). So that would give us the
optimal coefficients for that problem w.r.t. the squared error.

In order to transform this into a task suitable for map-reduce, the
equation is split into theta = A * b, where A = inv(X' * X) and b = X' *
y. X can be very large, its number of rows is the number of training
samples, the number of columns is the input dimensionality. Compared to X,
A is usually small with number of rows and cols each the input
dimensionality.

The mappers determine partial sums of matrix A and vector b using a subset
of X and y (a number of rows), let's call them A_i and b_i. The reducer
then sums all matrices A_i to the final A and does the same with b_i. It
finally computes theta = inv(A) * b. In the case of linear regression,
theta can be used to make multiple predictions (as X * theta), which could
be done in parallel using mappers, passing theta using the distributed
cache for instance.

In the case of locally weighted regression, however, the prediction input
already influences the determination of theta. Suppose we're looking for
the output y_query at some x_query and have examples x_0,...,x_N -->
y_0,...y_N. Again, the x_i go into a matrix X and the y_i into a column
vector y. Additionally, each row of X and each row of y are multiplied
with the result of a function weight(x_i, x_query). Usually we want
examples close to x_query to get a higher weight than those further away.
If we formulate this as matrices again, we get

theta = inv(X' * W * X) * X' * W * y,

where W is a diagonal matrix containing the weight of each example, i.e.,
W(x_i, x_i) = weight(x_i, x_query).

The map-reduce formulation is almost identical, except that now the
mappers first calculate weight(x_i, x_query) and then use this to
determine their A_i and b_i. The task of the reducer is unchanged. The
resulting theta is used to make only one prediction, namely y_query =
x_query * theta.

I hope that now it becomes clear why I need the prediction input in the
mappers. Unless one has millions of training examples, using map-reduce
here probably wouldn't make sense to make just one prediction.


>> shouldn't do the inversion literally. I'm now using Colt's Algebra.solve
>> as t = Algebra.solve(A, b).
[...]
> Is that code using LU decomposition or QR to do the solving?

I think in my case it uses LU, but I would have to look that. I'm not at
home right now where the computer with the code is.


[kernels]
>> other algorithms will make use of kernels as well.
> That sounds useful, but for now the LWLR package is a fine place for that.

Alright.


>> Moreover, I had to
>> enable reading/writing of matrices using sequence files, I think I will
>> make a separate patch for that.
> Isn't there a MatrixWritable class for this?

Yeah, but it actually doesn't do anything. There are just commented
matrix.write(out) and matrix.load(in) statements. I replaced those by
something similar to the code of VectorWritable. I think the idea there
was to let the actually used matrix implementation take care of the
details, but that wasn't done for Vectors and I think it's better to have
the details about input/output stream handling in the *Writable classes.
Otherwise we would create dependencies on the stream classes in the
Vector/Matrix classes. Also, there's no test for MatrixWritable. I think
tonight or tomorrow I can submit a patch containing both, a MatrixWritable
that actually writes and reads and also a test for that class.


Cheers,

Alex


Reply via email to