> Its not clear to me from ur description as to the exact sequence of steps > u r running thru, but an SSVD job requires a matrix as input (not a > sequencefile of <Text, VectorWritables>. > When u try running a seqdumper on ur SSVD output do u see anything? >
I see a Seqence File Text/VectorWritable with my original keys, and 99 valuesfor each element in my original dataset. The next step after u create ur sequencefiles of Vectors would be to run > the rowId job to generate a matrix and docIndex. > This matrix needs to be the input to SSVD (for dimensional reduction), Ok so I tried that and indeed the SSVD accepts the matrix as input and gives me a Sequence File IntWritable/VectorWritable. > followed by train Naive Bayes and test Naive Bayes. Here it doesn't work anymore, the NB wants a Sequence File Text/VectorWritable, and it won't take the one created hereabove. Is there a counterpart to rowId that takes a matrix and docIndex outputs the SequenceFile ? Kévin Moulart 2014-03-07 16:23 GMT+01:00 Suneel Marthi <suneel_mar...@yahoo.com>: > Its not clear to me from ur description as to the exact sequence of steps > u r running thru, but an SSVD job requires a matrix as input (not a > sequencefile of <Text, VectorWritables>. > > When u try running a seqdumper on ur SSVD output do u see anything? > > The next step after u create ur sequencefiles of Vectors would be to run > the rowId job to generate a matrix and docIndex. > > This matrix needs to be the input to SSVD (for dimensional reduction), > followed by train Naive Bayes and test Naive Bayes. > > > > > > On Friday, March 7, 2014 10:01 AM, Kevin Moulart <kevinmoul...@gmail.com> > wrote: > > Hi again, > > I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to > reduce the dimention of a dataset from 1600+ features to ~100 and then to > use the reducted dataset to train a naive bayes model and test it. > > So here is my workflow : > > - Transform my CSV into a SequencFile with > > key = class as Text (with a "/" in it to be accepted by NaiveBayes, so in > the for "class/class") using a custom job in MapReduce. > > value = features as VectorWritable > > - Use mahout command line to reduce the dimension of the dataset : > > mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o > /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U true > -pca -ow -t 3 > > ==> Here I get - if I understand things correctly - U, being the reducted > dataset. > > - Use mahout command line to train the NaiveBayes model : > > mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o > /user/myCompany/Echant/echant100k_red.model -l 0,1 > -li /user/myCompany/Echant/labelIndex100k_red -ow > > > - Use mahout command line to test the generated model : > > mahout testnb > -i /user/myCompany/Echant/echant100k_red.seq/U --model > /user/myCompany/Echant/echant100k_red.model -ow > -o /user/myCompany/Echant/predicted_echant100k --labelIndex > /user/myCompany/Echant/labelIndex100k_red > > (Here I test with the same dataset, but I should try with other datasets as > well once it runs smoothly) > > Here is my problem, everything seems to work quite well until I test my > model : the output is full of NaN : > > > Key: 1: Value: {0:NaN,1:NaN} > Key: 1: Value: {0:NaN,1:NaN} > Key: 0: Value: {0:NaN,1:NaN} > Key: 0: Value: {0:NaN,1:NaN} > Key: 1: Value: {0:NaN,1:NaN} > Key: 0: Value: {0:NaN,1:NaN} > Key: 1: Value: {0:NaN,1:NaN} > Key: 0: Value: {0:NaN,1:NaN} > Key: 0: Value: {0:NaN,1:NaN} > Key: 0: Value: {0:NaN,1:NaN} > Key: 1: Value: {0:NaN,1:NaN} > > > I already have the same problem when training and testing with the full > dataset but there, about 15% of the data still has values in output and > gets predicted, the rest being NaN and unpredicted. > > Could you help me see what could be causing that ? > > Thanks in advance > Bests, > > Kévin Moulart >