Thanks for the useful answer.

I agree with you about the usefulness of the Mahalanobis distance. My problem is an outlier detection problem. I have a set of documents and I want to discard the less informative or those that are "distant" from the core information of the set.

I'm implementing a search approach that iteratively selects the most promising documents. Starting from a random seed of documents, it tries to model the information that is in the seed and uses this model to find the next most similar document. At the moment, I'm modeling the information inside the seed documents computing the centroid, and then I compute the cosine similarity between the centroid and all the other documents. The most similar is added, the centroid is recomputed and so on till the stop criteria is not satisfied.

The problem of using the cosine similarity being an Euclidean distance, if I'm not wrong, is that it groups the document inside a sphere, so I was wondering to use some the Mahalonobis distance, because it allows me to group the document inside an ellipsoid. This means to compute the covariance matrix of the seed documents. I looked around searching for other approaches, but at the moment I'm stalled on the Mahalonobis distance and its covariance matrix.
Any suggestions are very welcome

Thanks a lot
Marco



On 20 Jul 2011, at 19:52, Ted Dunning wrote:

Mahalanobis distance as defined really depends on good estimates of
covariance and that is a pretty dense computation.

Generally, you need to define some restricted kind of covariance or a strong prior distribution on covariance in order to get a reasonable estimate in the kind of ill-posed problem that you are suggesting. One possibility is to assume that covariance is sparse with only a few non-diagonal terms. If you have some kind of heuristic for deciding where you will allow non-zeros, then you can build an estimation technique. Not surprisingly, I recommend
LLR for finding interesting points.

Another option is to look at rank limited approximations like SVD which are not dense, but do require less storage than fully dense representations.

Both of these can be viewed as decompositions, but this approach probably
only really helps with the SVD approach.

If I were working on this problem, though, I would strongly question the need for Mahalanobis distance in the first place. It can be lovely, but it really comes from an earlier time and I question the applicability to large
sparse matrices of counts.

On Wed, Jul 20, 2011 at 10:28 AM, Marco Turchi <[email protected]>wrote:

...
I have the last question:
I need to compute the covariance matrix of some input data to compute the Mahalanobis distance. In my data, the number of features is bigger than the
number of docs, and these data are very sparse. I cannot compute the
covariance matrix using the classic approach, because it becomes very time and computational expensive. Is there any implementation of an approximate way to compute it in Mahout? I have had a look in the library, but I do not
find it.

thanks for your help
Marco


On 20 Jul 2011, at 19:15, Ted Dunning wrote:

Well, that is hard to understand without a complete test case. In the
first
case you have a vector with one element constructed while in the second
case, you wind up with two elements and don't show the constructor.

Write up a JIRA and attach a unit test.

My strong expectation is that this is a case of refrigerator blindness.

On Wed, Jul 20, 2011 at 9:22 AM, Marco Turchi <[email protected]
wrote:

Hi
you are right about the sparse vector issue... but I'm constructing distV in the same way changing only + and - in the variable mean. In both the cases, I should have the same number of entries in the final vector.

Thanks a lot for your help
Marco


On 20 Jul 2011, at 17:42, Ted Dunning wrote:

You constructed the first vector with a dimension of 1. It looks like
you

constructed the second one with a larger dimension of 2.

When you offset a sparse vector, all of the zeros become non- zero and
the
vector becomes dense. This results in a bunch of cells being created.

On Wed, Jul 20, 2011 at 6:28 AM, marco turchi <[email protected]

wrote:


Dear All,

I have a strange behaviour when I use the method Plus for Vector.

I have a RandomAccessSparseVector vector, if I add a positive number, I
got
a new Vector where each element is the sum of the old value plus the positive number. While if I add a negative number, the new vector has 1
more
entry:


RandomAccessSparseVector distV = new RandomAccessSparseVector(1);
distV.setQuick(0,1);
double mean = 1;
RandomAccessSparseVector app =
(RandomAccessSparseVector)(****distV.plus(mean));

the output is
{0:2.0}

if I have
double mean = -1;
RandomAccessSparseVector app =
(RandomAccessSparseVector)(****distV.plus(mean));

the output is
{1:1.0,0:-1.0}

For sure I'm doing something wrong. Do you have any ideas where the
problem
is?

Thanks a lot in advance
Marco






Reply via email to