Re: PCA to improve classification performances

2014-03-10 Thread Dmitriy Lyubimov
Ok, it's just FYI as you build out your pipelines. FYI there's a bit of inconsistency between DRM-based in methods in mahout. Some methods require Int row keys, some don't. Yet them some also rely on names of a NamedVector, and some don't . PCA/SSVD propagates BOTH keys from sequence file AND nam

Re: PCA to improve classification performances

2014-03-10 Thread Kevin Moulart
Yes but rowId transforms my dataset into an index which associates keys like 0, 1, 2... to my actual key and a sequence file indexed using these new keys, as integer. Then pca/ssvd comes in, outputs a reducted matrix (as a sequence file using the same keys it found in the input file, which are the

Re: PCA to improve classification performances

2014-03-10 Thread Dmitriy Lyubimov
Pca and ssvd propagates exact row keys given in the input. If you give it text keys, U and Usigma will have text keys. It doesn t change that. On Mar 10, 2014 3:39 AM, "Kevin Moulart" wrote: > Hi and thanks, I'll try that, but I'd like to do so using a mapreduce job > to improve performances. > >

Re: PCA to improve classification performances

2014-03-10 Thread Kevin Moulart
Hi and thanks, I'll try that, but I'd like to do so using a mapreduce job to improve performances. I'm using PCA as a way to reduce the dimension of the dataset both to improve its relevance (with 1600+ variables, many of them are correlated) and to improve the performances of the classification a

Re: PCA to improve classification performances

2014-03-10 Thread Suneel Marthi
On Monday, March 10, 2014 4:21 AM, Kevin Moulart wrote: Its not clear to me from ur description as to the exact sequence of steps u r running thru, but an SSVD job requires a matrix as input (not a sequencefile of . >When u try running a seqdumper on ur SSVD output do u see anything? >

Re: PCA to improve classification performances

2014-03-10 Thread Kevin Moulart
> Its not clear to me from ur description as to the exact sequence of steps > u r running thru, but an SSVD job requires a matrix as input (not a > sequencefile of . > When u try running a seqdumper on ur SSVD output do u see anything? > I see a Seqence File Text/VectorWritable with my original ke

Re: PCA to improve classification performances

2014-03-07 Thread Suneel Marthi
Its not clear to me from ur description as to the exact sequence of steps u r running thru, but an SSVD job requires a matrix as input (not a sequencefile of . When u try running a seqdumper on ur SSVD output do u see anything? The next step after u create ur sequencefiles of Vectors would be

PCA to improve classification performances

2014-03-07 Thread Kevin Moulart
Hi again, I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to reduce the dimention of a dataset from 1600+ features to ~100 and then to use the reducted dataset to train a naive bayes model and test it. So here is my workflow : - Transform my CSV into a SequencFile with key