Re: Using SSVD for dimensionality reduction on Mahout

Vijay B Fri, 21 Mar 2014 21:59:30 -0700

Yes, I agree that the dimensions of my dataset are low, I intended to only
experiment with this data and then apply SSVD on a huge dataset.


I was actually interested in finding out how my original variables are
contributing to every principal component and your reply answered to that.
Many thanks!

Thanks,
Vijay.



On Sat, Mar 22, 2014 at 3:00 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Vijay,
>
> what Ted said. It doesn't make tons of sense to do reduction from 12 to 7
> because 12 is still dimensionality low enough.
>
> But suppose we accept rational of reducing 12 dimensions into 7. Your
> original points are rotated into PCA space of 7 dimensions where they
> retain most (as much as possible) of variance of original data, i.e.
> basically retain proportions of euclidean distances between each other,
> i.e. still suitable for stuff like clustering or regression or whatever
> else you want to do with them on that basis.
>
> your U*Sigma output should have the same keys as the input.
> If you want to analyze what is contribution of your original variables to
> every principal component, you need to examine V output, which in your case
> will be really tiny, 12 x 7.
>
>
>
>
> On Fri, Mar 21, 2014 at 11:12 AM, Vijay B <b.vijay....@gmail.com> wrote:
>
> > Thanks a lot for the reply.
> >
> > To gain an understanding of how SSVD works, I have taken a sample CSV
> file
> > with 12 columns and I want to perform dimensionality reduction on it by
> > asking SSVD to give me 7 most significant columns.
> >
> > Snippet of my input csv
> >
> > 22,2,44,36,5,9,2824,2,4,733,285,169
> > 25,1,150,175,3,9,4037,2,18,1822,254,171
> >
> > Here's what I have done.
> > Step 1: Converted the csv to a sequence file, below is a snippet of the
> > output
> > Key: 1: Value:
> >
> >
> 1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
> > Key: 2: Value:
> >
> >
> 2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
> >
> > Step 2; Passed this sequence file as input to the SSVD command, below is
> > the command I used
> >
> > bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
> > /user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false
> -pca
> > true -ow -t 1
> >
> >  I then executed vectordump on the contents of USigma folder, below is a
> > snippet of the output
> >
> >
> >
> {0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
> >
> >
> {0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
> >
> > Please help me interpret the above results in the USigma folder.
> >
> > Thanks,
> > Vijay.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
> >
> > > Vijay, how many columns do you have in the CSV? That is the number you
> > > will be reducing.
> > >
> > > csv:
> > > 1,22,33,44,55
> > > 13,23,34,45,56
> > >
> > > would be dense vectors:
> > > Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> > > Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
> > >
> > > Unless you have some reason to assign different dimension indexes the
> row
> > > and column numbers from your csv should be used in Mahout. Internal to
> > > Mahout the dimensions are assumed to be ordinal. If you do have reasons
> > to
> > > say column 1 corresponds to something with an id of 12 (your example
> > below)
> > > then you handle that in the output phase of your problem. In other
> words
> > if
> > > you get an answer corresponding to the Mahout column index of 1, you
> > lookup
> > > its association to 12 in some dictionary you keep outside of Mahout,
> same
> > > with the row keys. Don't put external Ids in the matrix unless they
> > really
> > > are ordinal dimensions.
> > >
> > > As Dmitriy said this sounds like a Dense matrix problem. Usually when
> > I've
> > > used SSVD it was on a matrix with 80,000-500,000 columns in a very
> sparse
> > > matrix so reduction yields big benefits. Also remember that the output
> is
> > > always a dense matrix so ops performed on it tend to be more heavy
> > weight.
> > >
> > >
> > > On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> > >
> > > On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <b.vijay....@gmail.com>
> wrote:
> > >
> > > > Thanks a lot for the detailed explanation, it was very helpful.
> > > > I will write a CSV to sequence converter, just needed some clarity on
> > the
> > > > key/value pairs in the sequence file.
> > > >
> > > > Suppose my csv file contains the below values
> > > > 11,22,33,44,55
> > > > 13,23,34,45,56
> > > >
> > > > I assume that the sequence file would look like this, where 12, 1,
> 14,
> > 8,
> > > > 15 are indices which hold the values
> > > > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > > > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> > > >
> > >
> > > I am not sure -- why are you remapping ordinal position into an index
> > > position? Obviously, DRM supports sparse computations (i.e. you can use
> > > either SequetialAccessSparseVector or RandomAccessSparseVector as
> vector
> > > values, as long as they have the same cardinality). However, if you
> imply
> > > that all data point ordinal positions map into the same sparse vector
> > > index, then there's no true sparsity here and you could just form dense
> > > vectors in ordinal order of your data, it seems.
> > >
> > > Other than that, I don't see any issues with your assumptions.
> > >
> > >
> > > > Please confirm if my understanding is correct.
> > > >
> > > > Thanks,
> > > > Vijay
> > > >
> > > >
> > > > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <
> dlie...@gmail.com
> > > >> wrote:
> > > >
> > > >> I am not sure if we have direct CSV converters to do that; CSV is
> not
> > > > that
> > > >> expressive anyway. But it is not difficult to write up such
> converter
> > on
> > > >> your own, i suppose.
> > > >>
> > > >> The steps you need to do is this :
> > > >>
> > > >> (1) prepare set of data points in a form of (unique vector key,
> > > n-vector)
> > > >> tuples. Vector key can be anything that can be adapted into a
> > > >> WritableComparable. Notably, Long or String. Vector key also has to
> be
> > > >> unique to make sense for you.
> > > >> (2) save the above tuples into a set of sequence files so that
> > sequence
> > > >> file key is unique vector key, and sequence file value is
> > > >> o.a.m.math.VectorWritable.
> > > >> (3) decide how many dimensions there will be in reduced space. The
> key
> > > is
> > > >> reduced, i.e. you don't need too many. Say 50.
> > > >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > > >> reduced dimensionality output will be in the folder USigma. The
> output
> > > > will
> > > >> have same keys bounds to vectors in reduced space of k dimensions.
> > > >>
> > > >>
> > > >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <b.vijay....@gmail.com>
> > wrote:
> > > >>
> > > >>> Hi All,
> > > >>> I have a CSV file on which I've to perform dimensionality
> reduction.
> > > > I'm
> > > >>> new to Mahout, on doing some search I understood that SSVD can be
> > used
> > > >> for
> > > >>> performing dimensionality reduction. I'm not sure of the steps that
> > > > have
> > > >> to
> > > >>> be executed before  SSVD, please help me.
> > > >>>
> > > >>> Thanks,
> > > >>> Vijay
> > > >>>
> > > >>
> > > >
> > >
> > >
> >
>

Re: Using SSVD for dimensionality reduction on Mahout

Reply via email to