Yes, I agree that the dimensions of my dataset are low, I intended to only
experiment with this data and then apply SSVD on a huge dataset.

I was actually interested in finding out how my original variables are
contributing to every principal component and your reply answered to that.
Many thanks!


On Sat, Mar 22, 2014 at 3:00 AM, Dmitriy Lyubimov <> wrote:

> Vijay,
> what Ted said. It doesn't make tons of sense to do reduction from 12 to 7
> because 12 is still dimensionality low enough.
> But suppose we accept rational of reducing 12 dimensions into 7. Your
> original points are rotated into PCA space of 7 dimensions where they
> retain most (as much as possible) of variance of original data, i.e.
> basically retain proportions of euclidean distances between each other,
> i.e. still suitable for stuff like clustering or regression or whatever
> else you want to do with them on that basis.
> your U*Sigma output should have the same keys as the input.
> If you want to analyze what is contribution of your original variables to
> every principal component, you need to examine V output, which in your case
> will be really tiny, 12 x 7.
> On Fri, Mar 21, 2014 at 11:12 AM, Vijay B <> wrote:
> > Thanks a lot for the reply.
> >
> > To gain an understanding of how SSVD works, I have taken a sample CSV
> file
> > with 12 columns and I want to perform dimensionality reduction on it by
> > asking SSVD to give me 7 most significant columns.
> >
> > Snippet of my input csv
> >
> > 22,2,44,36,5,9,2824,2,4,733,285,169
> > 25,1,150,175,3,9,4037,2,18,1822,254,171
> >
> > Here's what I have done.
> > Step 1: Converted the csv to a sequence file, below is a snippet of the
> > output
> > Key: 1: Value:
> >
> >
> 1:{0:22.0,1:2.0,2:44.0,3:36.0,4:5.0,5:9.0,6:2824.0,7:2.0,8:4.0,9:733.0,10:285.0,11:169.0}
> > Key: 2: Value:
> >
> >
> 2:{0:25.0,1:1.0,2:150.0,3:175.0,4:3.0,5:9.0,6:4037.0,7:2.0,8:18.0,9:1822.0,10:254.0,11:171.0}
> >
> > Step 2; Passed this sequence file as input to the SSVD command, below is
> > the command I used
> >
> > bin/mahout ssvd -i /user/cloudera/seq-data.seq -o
> > /user/cloudera/reduced_dimensions1 --rank 7 -us true -V false -U false
> -pca
> > true -ow -t 1
> >
> >  I then executed vectordump on the contents of USigma folder, below is a
> > snippet of the output
> >
> >
> >
> {0:191.5917217160858,1:-349.96930149831184,2:-78.21082086351002,3:98.73075808083476,4:-122.89919847376068,5:4.160343860343885,6:1.4336136023933244}
> >
> >
> {0:1293.9486625354516,1:697.7408635015182,2:24.0653800270275,3:60.79480738654566,4:11.733624175113523,5:6.479815864873287,6:-0.9269136621845396}
> >
> > Please help me interpret the above results in the USigma folder.
> >
> > Thanks,
> > Vijay.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Mar 21, 2014 at 9:52 PM, Pat Ferrel <>
> wrote:
> >
> > > Vijay, how many columns do you have in the CSV? That is the number you
> > > will be reducing.
> > >
> > > csv:
> > > 1,22,33,44,55
> > > 13,23,34,45,56
> > >
> > > would be dense vectors:
> > > Key:1: Value:{1:1,2:22,3:33,4:44,5:55}
> > > Key: 2: Value:{1:13,2:23,3:34,4:45,5:56}
> > >
> > > Unless you have some reason to assign different dimension indexes the
> row
> > > and column numbers from your csv should be used in Mahout. Internal to
> > > Mahout the dimensions are assumed to be ordinal. If you do have reasons
> > to
> > > say column 1 corresponds to something with an id of 12 (your example
> > below)
> > > then you handle that in the output phase of your problem. In other
> words
> > if
> > > you get an answer corresponding to the Mahout column index of 1, you
> > lookup
> > > its association to 12 in some dictionary you keep outside of Mahout,
> same
> > > with the row keys. Don't put external Ids in the matrix unless they
> > really
> > > are ordinal dimensions.
> > >
> > > As Dmitriy said this sounds like a Dense matrix problem. Usually when
> > I've
> > > used SSVD it was on a matrix with 80,000-500,000 columns in a very
> sparse
> > > matrix so reduction yields big benefits. Also remember that the output
> is
> > > always a dense matrix so ops performed on it tend to be more heavy
> > weight.
> > >
> > >
> > > On Mar 19, 2014, at 11:16 AM, Dmitriy Lyubimov <>
> > wrote:
> > >
> > > On Wed, Mar 19, 2014 at 11:00 AM, Vijay B <>
> wrote:
> > >
> > > > Thanks a lot for the detailed explanation, it was very helpful.
> > > > I will write a CSV to sequence converter, just needed some clarity on
> > the
> > > > key/value pairs in the sequence file.
> > > >
> > > > Suppose my csv file contains the below values
> > > > 11,22,33,44,55
> > > > 13,23,34,45,56
> > > >
> > > > I assume that the sequence file would look like this, where 12, 1,
> 14,
> > 8,
> > > > 15 are indices which hold the values
> > > > Key:1: Value:{12:11,1:22,14:33,8:44,15:55}
> > > > Key: 2: Value:{12:13,1:23,14:34,8:45,15:56}
> > > >
> > >
> > > I am not sure -- why are you remapping ordinal position into an index
> > > position? Obviously, DRM supports sparse computations (i.e. you can use
> > > either SequetialAccessSparseVector or RandomAccessSparseVector as
> vector
> > > values, as long as they have the same cardinality). However, if you
> imply
> > > that all data point ordinal positions map into the same sparse vector
> > > index, then there's no true sparsity here and you could just form dense
> > > vectors in ordinal order of your data, it seems.
> > >
> > > Other than that, I don't see any issues with your assumptions.
> > >
> > >
> > > > Please confirm if my understanding is correct.
> > > >
> > > > Thanks,
> > > > Vijay
> > > >
> > > >
> > > > On Wed, Mar 19, 2014 at 11:02 PM, Dmitriy Lyubimov <
> > > >> wrote:
> > > >
> > > >> I am not sure if we have direct CSV converters to do that; CSV is
> not
> > > > that
> > > >> expressive anyway. But it is not difficult to write up such
> converter
> > on
> > > >> your own, i suppose.
> > > >>
> > > >> The steps you need to do is this :
> > > >>
> > > >> (1) prepare set of data points in a form of (unique vector key,
> > > n-vector)
> > > >> tuples. Vector key can be anything that can be adapted into a
> > > >> WritableComparable. Notably, Long or String. Vector key also has to
> be
> > > >> unique to make sense for you.
> > > >> (2) save the above tuples into a set of sequence files so that
> > sequence
> > > >> file key is unique vector key, and sequence file value is
> > > >> o.a.m.math.VectorWritable.
> > > >> (3) decide how many dimensions there will be in reduced space. The
> key
> > > is
> > > >> reduced, i.e. you don't need too many. Say 50.
> > > >> (4) run mahout ssvd --pca true --us true --v false -k <k> .... . The
> > > >> reduced dimensionality output will be in the folder USigma. The
> output
> > > > will
> > > >> have same keys bounds to vectors in reduced space of k dimensions.
> > > >>
> > > >>
> > > >> On Wed, Mar 19, 2014 at 9:45 AM, Vijay B <>
> > wrote:
> > > >>
> > > >>> Hi All,
> > > >>> I have a CSV file on which I've to perform dimensionality
> reduction.
> > > > I'm
> > > >>> new to Mahout, on doing some search I understood that SSVD can be
> > used
> > > >> for
> > > >>> performing dimensionality reduction. I'm not sure of the steps that
> > > > have
> > > >> to
> > > >>> be executed before  SSVD, please help me.
> > > >>>
> > > >>> Thanks,
> > > >>> Vijay
> > > >>>
> > > >>
> > > >
> > >
> > >
> >

Reply via email to