Thanks for the useful answer.
I agree with you about the usefulness of the Mahalanobis distance. My
problem is an outlier detection problem. I have a set of documents and
I want to discard the less informative or those that are "distant"
from the core information of the set.
I'm implementing a search approach that iteratively selects the most
promising documents. Starting from a random seed of documents, it
tries to model the information that is in the seed and uses this model
to find the next most similar document. At the moment, I'm modeling
the information inside the seed documents computing the centroid, and
then I compute the cosine similarity between the centroid and all the
other documents. The most similar is added, the centroid is recomputed
and so on till the stop criteria is not satisfied.
The problem of using the cosine similarity being an Euclidean
distance, if I'm not wrong, is that it groups the document inside a
sphere, so I was wondering to use some the Mahalonobis distance,
because it allows me to group the document inside an ellipsoid. This
means to compute the covariance matrix of the seed documents.
I looked around searching for other approaches, but at the moment I'm
stalled on the Mahalonobis distance and its covariance matrix.
Any suggestions are very welcome
Thanks a lot
Marco
On 20 Jul 2011, at 19:52, Ted Dunning wrote:
Mahalanobis distance as defined really depends on good estimates of
covariance and that is a pretty dense computation.
Generally, you need to define some restricted kind of covariance or
a strong
prior distribution on covariance in order to get a reasonable
estimate in
the kind of ill-posed problem that you are suggesting. One
possibility is
to assume that covariance is sparse with only a few non-diagonal
terms. If
you have some kind of heuristic for deciding where you will allow
non-zeros,
then you can build an estimation technique. Not surprisingly, I
recommend
LLR for finding interesting points.
Another option is to look at rank limited approximations like SVD
which are
not dense, but do require less storage than fully dense
representations.
Both of these can be viewed as decompositions, but this approach
probably
only really helps with the SVD approach.
If I were working on this problem, though, I would strongly question
the
need for Mahalanobis distance in the first place. It can be lovely,
but it
really comes from an earlier time and I question the applicability
to large
sparse matrices of counts.
On Wed, Jul 20, 2011 at 10:28 AM, Marco Turchi
<[email protected]>wrote:
...
I have the last question:
I need to compute the covariance matrix of some input data to
compute the
Mahalanobis distance. In my data, the number of features is bigger
than the
number of docs, and these data are very sparse. I cannot compute the
covariance matrix using the classic approach, because it becomes
very time
and computational expensive. Is there any implementation of an
approximate
way to compute it in Mahout? I have had a look in the library, but
I do not
find it.
thanks for your help
Marco
On 20 Jul 2011, at 19:15, Ted Dunning wrote:
Well, that is hard to understand without a complete test case. In
the
first
case you have a vector with one element constructed while in the
second
case, you wind up with two elements and don't show the constructor.
Write up a JIRA and attach a unit test.
My strong expectation is that this is a case of refrigerator
blindness.
On Wed, Jul 20, 2011 at 9:22 AM, Marco Turchi
<[email protected]
wrote:
Hi
you are right about the sparse vector issue... but I'm
constructing distV
in the same way changing only + and - in the variable mean. In
both the
cases, I should have the same number of entries in the final
vector.
Thanks a lot for your help
Marco
On 20 Jul 2011, at 17:42, Ted Dunning wrote:
You constructed the first vector with a dimension of 1. It looks
like
you
constructed the second one with a larger dimension of 2.
When you offset a sparse vector, all of the zeros become non-
zero and
the
vector becomes dense. This results in a bunch of cells being
created.
On Wed, Jul 20, 2011 at 6:28 AM, marco turchi <[email protected]
wrote:
Dear All,
I have a strange behaviour when I use the method Plus for Vector.
I have a RandomAccessSparseVector vector, if I add a positive
number, I
got
a new Vector where each element is the sum of the old value
plus the
positive number. While if I add a negative number, the new
vector has 1
more
entry:
RandomAccessSparseVector distV = new RandomAccessSparseVector(1);
distV.setQuick(0,1);
double mean = 1;
RandomAccessSparseVector app =
(RandomAccessSparseVector)(****distV.plus(mean));
the output is
{0:2.0}
if I have
double mean = -1;
RandomAccessSparseVector app =
(RandomAccessSparseVector)(****distV.plus(mean));
the output is
{1:1.0,0:-1.0}
For sure I'm doing something wrong. Do you have any ideas where
the
problem
is?
Thanks a lot in advance
Marco