Re: Problem with method Plus in the Vector class

Marco Turchi Wed, 20 Jul 2011 11:33:38 -0700

Thanks for the useful answer.

I agree with you about the usefulness of the Mahalanobis distance. Myproblem is an outlier detection problem. I have a set of documents andI want to discard the less informative or those that are "distant"from the core information of the set.

I'm implementing a search approach that iteratively selects the mostpromising documents. Starting from a random seed of documents, ittries to model the information that is in the seed and uses this modelto find the next most similar document. At the moment, I'm modelingthe information inside the seed documents computing the centroid, andthen I compute the cosine similarity between the centroid and all theother documents. The most similar is added, the centroid is recomputedand so on till the stop criteria is not satisfied.

The problem of using the cosine similarity being an Euclideandistance, if I'm not wrong, is that it groups the document inside asphere, so I was wondering to use some the Mahalonobis distance,because it allows me to group the document inside an ellipsoid. Thismeans to compute the covariance matrix of the seed documents.I looked around searching for other approaches, but at the moment I'mstalled on the Mahalonobis distance and its covariance matrix.

Any suggestions are very welcome

Thanks a lot
Marco



On 20 Jul 2011, at 19:52, Ted Dunning wrote:

Mahalanobis distance as defined really depends on good estimates of
covariance and that is a pretty dense computation.
Generally, you need to define some restricted kind of covariance ora strongprior distribution on covariance in order to get a reasonableestimate inthe kind of ill-posed problem that you are suggesting. Onepossibility isto assume that covariance is sparse with only a few non-diagonalterms. Ifyou have some kind of heuristic for deciding where you will allownon-zeros,then you can build an estimation technique. Not surprisingly, Irecommend
LLR for finding interesting points.
Another option is to look at rank limited approximations like SVDwhich arenot dense, but do require less storage than fully denserepresentations.
Both of these can be viewed as decompositions, but this approachprobably
only really helps with the SVD approach.
If I were working on this problem, though, I would strongly questiontheneed for Mahalanobis distance in the first place. It can be lovely,but itreally comes from an earlier time and I question the applicabilityto large
sparse matrices of counts.
On Wed, Jul 20, 2011 at 10:28 AM, Marco Turchi<[email protected]>wrote:
...
I have the last question:
I need to compute the covariance matrix of some input data tocompute theMahalanobis distance. In my data, the number of features is biggerthan the
number of docs, and these data are very sparse. I cannot compute the
covariance matrix using the classic approach, because it becomesvery timeand computational expensive. Is there any implementation of anapproximateway to compute it in Mahout? I have had a look in the library, butI do not
find it.

thanks for your help
Marco


On 20 Jul 2011, at 19:15, Ted Dunning wrote:
Well, that is hard to understand without a complete test case. Inthe
first
case you have a vector with one element constructed while in thesecond
case, you wind up with two elements and don't show the constructor.

Write up a JIRA and attach a unit test.
My strong expectation is that this is a case of refrigeratorblindness.
On Wed, Jul 20, 2011 at 9:22 AM, Marco Turchi<[email protected]
wrote:
Hi
you are right about the sparse vector issue... but I'mconstructing distVin the same way changing only + and - in the variable mean. Inboth thecases, I should have the same number of entries in the finalvector.
Thanks a lot for your help
Marco


On 20 Jul 2011, at 17:42, Ted Dunning wrote:
You constructed the first vector with a dimension of 1. It lookslike
you
constructed the second one with a larger dimension of 2.
When you offset a sparse vector, all of the zeros become non-zero and
the
vector becomes dense. This results in a bunch of cells beingcreated.
On Wed, Jul 20, 2011 at 6:28 AM, marco turchi <[email protected]
wrote:
Dear All,
I have a strange behaviour when I use the method Plus for Vector.
I have a RandomAccessSparseVector vector, if I add a positivenumber, I
got
a new Vector where each element is the sum of the old valueplus thepositive number. While if I add a negative number, the newvector has 1
more
entry:


RandomAccessSparseVector distV = new RandomAccessSparseVector(1);
distV.setQuick(0,1);
double mean = 1;
RandomAccessSparseVector app =
(RandomAccessSparseVector)(****distV.plus(mean));

the output is
{0:2.0}

if I have
double mean = -1;
RandomAccessSparseVector app =
(RandomAccessSparseVector)(****distV.plus(mean));

the output is
{1:1.0,0:-1.0}
For sure I'm doing something wrong. Do you have any ideas wherethe
problem
is?

Thanks a lot in advance
Marco

Re: Problem with method Plus in the Vector class

Reply via email to