Re: PCA to improve classification performances

Suneel Marthi Mon, 10 Mar 2014 01:47:39 -0700



On Monday, March 10, 2014 4:21 AM, Kevin Moulart <kevinmoul...@gmail.com> wrote:
 


Its not clear to me from ur description as to the exact sequence of steps u r 
running thru, but an SSVD job requires a matrix as input (not a sequencefile of 
<Text, VectorWritables>.
>When u try running a seqdumper on ur SSVD output do u see anything?
>
 
I see a Seqence File Text/VectorWritable with my original keys, and 99 
valuesfor each element in my original dataset. 

The next step after u create ur sequencefiles of Vectors would be to run the 
rowId job to generate a matrix and docIndex.
>
This matrix needs to be the input to SSVD (for dimensional reduction),

Ok so I tried that and indeed the SSVD accepts the matrix as input and gives me 
a Sequence File IntWritable/VectorWritable.
 
followed by train Naive Bayes and test Naive Bayes.

Here it doesn't work anymore, the NB wants a Sequence File Text/VectorWritable, 
and it won't take the one created hereabove.
Is there a counterpart to rowId that takes a matrix and docIndex outputs the 
SequenceFile ?

>> Hmm...  not that I know of.  You are gonna have to write a utility that 
>> reads docIndex and <IntWritable/VectorWritable> as inputs.
     a)  Create a dictionary of documentId, documentName  from docIndex
     b)
         (i) Read the Pair<Intwritable, VectorWritable> from the 
sequencefile<IntWritable,VectorWritable>, 
         (ii) for each pair, read the key <IntWritable> and value 
<VectorWritable> {
                  replace each key with the corresponding DocumentName <Text> 
from dictionary in (a)
                  SequenceFile,Writer.write(Text, VectorWritable)
              }

   Question: I might have missed it but what's the reason again u r calling PCA 
for before running TrainNaiveBayes ? 
    
   If others, have a better ideas please feel free to comment.




Kévin Moulart


2014-03-07 16:23 GMT+01:00 Suneel Marthi <suneel_mar...@yahoo.com>:

Its not clear to me from ur description as to the exact sequence of steps u r 
running thru, but an SSVD job requires a matrix as input (not a sequencefile of 
<Text, VectorWritables>.
>
>When u try running a seqdumper on ur SSVD output do u see anything?
>
>The next step after u create ur sequencefiles of Vectors would be to run the 
>rowId job to generate a matrix and docIndex.
>
>This matrix needs to be the input to SSVD (for dimensional reduction), 
>followed by train Naive Bayes and test Naive Bayes.
>
>
>
>
>
>
>On Friday, March 7, 2014 10:01 AM, Kevin Moulart <kevinmoul...@gmail.com> 
>wrote:
>
>Hi again,
>
>I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to
>reduce the dimention of a dataset from 1600+ features to ~100 and then to
>use the reducted dataset to train a naive bayes model and test it.
>
>So here is my workflow :
>
>   - Transform my CSV into a SequencFile with
>
>
>key = class as Text (with a "/" in it to be accepted by NaiveBayes, so in
>the for "class/class") using a custom job in MapReduce.
>
>value = features as VectorWritable
>
>   - Use mahout command line to reduce the dimension of the dataset :
>
>
>mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o
>/user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U true
>-pca -ow -t 3
>
>==> Here I get - if I understand things correctly - U, being the reducted
>dataset.
>
>   - Use mahout command line to train the NaiveBayes model :
>
>
>mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o
>/user/myCompany/Echant/echant100k_red.model -l 0,1
>-li /user/myCompany/Echant/labelIndex100k_red -ow
>
>
>   - Use mahout command line to test the generated model :
>
>
>mahout testnb
>-i /user/myCompany/Echant/echant100k_red.seq/U --model
>/user/myCompany/Echant/echant100k_red.model -ow
>-o /user/myCompany/Echant/predicted_echant100k --labelIndex
>/user/myCompany/Echant/labelIndex100k_red
>
>(Here I test with the same dataset, but I should try with other datasets as
>well once it runs smoothly)
>
>Here is my problem, everything seems to work quite well until I test my
>model : the output is full of NaN :
>
>
>Key: 1: Value: {0:NaN,1:NaN}
>Key: 1: Value: {0:NaN,1:NaN}
>Key: 0: Value: {0:NaN,1:NaN}
>Key: 0: Value: {0:NaN,1:NaN}
>Key: 1: Value: {0:NaN,1:NaN}
>Key: 0: Value: {0:NaN,1:NaN}
>Key: 1: Value: {0:NaN,1:NaN}
>Key: 0: Value: {0:NaN,1:NaN}
>Key: 0: Value: {0:NaN,1:NaN}
>Key: 0: Value: {0:NaN,1:NaN}
>Key: 1: Value: {0:NaN,1:NaN}
>
>
>I already have the same problem when training and testing with the full
>dataset but there, about 15% of the data still has values in output and
>gets predicted, the rest being NaN and unpredicted.
>
>Could you help me see what could be causing that ?
>
>Thanks in advance
>Bests,
>
>Kévin Moulart
Re: PCA to improve classification performances

Reply via email to