Actually, this is probably done more easily using a simple matrix multiplication. The reason for not using recommendation code for this is that your problem is entirely dense.
How exactly you should go about this is a different question. Up to tens of thousands of stars, you can probably do this on a single machine using pretty standard tools like R or matlab. For larger problems, you will need parallelize the problem. Essentially, if A contains your data this turns in either A A' (if stars are rows) or A' A (if stars are columns). The real problem is that your output is going to be as big as the number of stars, squared. This will probably limit the feasibility of this computation. A million stars will result in something like 10TB of output. Assuming you have a million stars and each spectrum contains a few thousand observations, the way I would go about this computation would be to store each spectrum as a row, and dividing your data file into batches of rows. Call the full matrix A and each batch of rows A_1 ... A_n. Each batch should have however many rows it takes to get a matrix product A_i A_j' to take 30-100 seconds. Now, all you have to do is schedule the multiplication of every pair of A_i and A_j. How you do that and how you store the data won't matter very much because the computation costs will outweigh the scheduling and I/O costs. The output will consist of matrices B_ij that each contain the dot products between all of the stars in A_i and all of the starts in A_j. To find the dot product of two arbitrary stars, you first have to find which batches they are in, and then you need to find their product in the corresponding B_ij file. You should probably check out some of the efficient math packages for doing the local multiplications. My guess is that this is very much not what you really want to be doing. It is much more likely that you want to have an efficient nearest neighbor search engine so that you can quickly find the, say, thousand most similar stars given any query star. That can be done with packages like FLANN [1] or others [2]. Mahout will not help you with this given the dense nature of your data. [1] http://www.cs.ubc.ca/research/flann/ [2] https://www.cs.umd.edu/~mount/ANN/ On Wed, May 13, 2015 at 11:15 PM, Jonathan Seale <jonathanpse...@gmail.com> wrote: > Scientists, > > I have an astrophysical application for Mahout that I need help with. > > I have 1-dimensional stellar spectra for many, many stars. Each spectrum > consists of a series of intensity values, one per wavelength of light. I > need to be able to find the cosine similarity between ALL pairs of stars. > Seems to me this is simply a user-user similarity problem where I have > stars instead of users, wavelengths instead of items, and intensities > instead of ratings/clicks. > > But I'm having difficulty using mahout's row similarity package (I'm new to > this, and these days astronomers code pretty exclusively in python). I know > that I must have to 1) create a sparse matrix where each row is a star, > columns are wavelengths, and the values are intensity, and 2) implement row > similarity. But I'm just not sure how to do it. Anyone have a good resource > or be willing to help? I could probably offer some compensation to anyone > that would be willing to provide a little focussed, personalized > assistance. > > Thanks, > Jonathan >