Re: PCA to improve classification performances

Kevin Moulart Mon, 10 Mar 2014 01:22:24 -0700

> Its not clear to me from ur description as to the exact sequence of steps
> u r running thru, but an SSVD job requires a matrix as input (not a
> sequencefile of <Text, VectorWritables>.
> When u try running a seqdumper on ur SSVD output do u see anything?
>


I see a Seqence File Text/VectorWritable with my original keys, and 99
valuesfor each element in my original dataset.

The next step after u create ur sequencefiles of Vectors would be to run
> the rowId job to generate a matrix and docIndex.
>
This matrix needs to be the input to SSVD (for dimensional reduction),


Ok so I tried that and indeed the SSVD accepts the matrix as input and
gives me a Sequence File IntWritable/VectorWritable.


> followed by train Naive Bayes and test Naive Bayes.


Here it doesn't work anymore, the NB wants a Sequence File
Text/VectorWritable, and it won't take the one created hereabove.
Is there a counterpart to rowId that takes a matrix and docIndex outputs
the SequenceFile ?

Kévin Moulart


2014-03-07 16:23 GMT+01:00 Suneel Marthi <suneel_mar...@yahoo.com>:

> Its not clear to me from ur description as to the exact sequence of steps
> u r running thru, but an SSVD job requires a matrix as input (not a
> sequencefile of <Text, VectorWritables>.
>
> When u try running a seqdumper on ur SSVD output do u see anything?
>
> The next step after u create ur sequencefiles of Vectors would be to run
> the rowId job to generate a matrix and docIndex.
>
> This matrix needs to be the input to SSVD (for dimensional reduction),
> followed by train Naive Bayes and test Naive Bayes.
>
>
>
>
>
> On Friday, March 7, 2014 10:01 AM, Kevin Moulart <kevinmoul...@gmail.com>
> wrote:
>
> Hi again,
>
> I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to
> reduce the dimention of a dataset from 1600+ features to ~100 and then to
> use the reducted dataset to train a naive bayes model and test it.
>
> So here is my workflow :
>
>    - Transform my CSV into a SequencFile with
>
> key = class as Text (with a "/" in it to be accepted by NaiveBayes, so in
> the for "class/class") using a custom job in MapReduce.
>
> value = features as VectorWritable
>
>    - Use mahout command line to reduce the dimension of the dataset :
>
> mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o
> /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U true
> -pca -ow -t 3
>
> ==> Here I get - if I understand things correctly - U, being the reducted
> dataset.
>
>    - Use mahout command line to train the NaiveBayes model :
>
> mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o
> /user/myCompany/Echant/echant100k_red.model -l 0,1
> -li /user/myCompany/Echant/labelIndex100k_red -ow
>
>
>    - Use mahout command line to test the generated model :
>
> mahout testnb
> -i /user/myCompany/Echant/echant100k_red.seq/U --model
> /user/myCompany/Echant/echant100k_red.model -ow
> -o /user/myCompany/Echant/predicted_echant100k --labelIndex
> /user/myCompany/Echant/labelIndex100k_red
>
> (Here I test with the same dataset, but I should try with other datasets as
> well once it runs smoothly)
>
> Here is my problem, everything seems to work quite well until I test my
> model : the output is full of NaN :
>
>
> Key: 1: Value: {0:NaN,1:NaN}
> Key: 1: Value: {0:NaN,1:NaN}
> Key: 0: Value: {0:NaN,1:NaN}
> Key: 0: Value: {0:NaN,1:NaN}
> Key: 1: Value: {0:NaN,1:NaN}
> Key: 0: Value: {0:NaN,1:NaN}
> Key: 1: Value: {0:NaN,1:NaN}
> Key: 0: Value: {0:NaN,1:NaN}
> Key: 0: Value: {0:NaN,1:NaN}
> Key: 0: Value: {0:NaN,1:NaN}
> Key: 1: Value: {0:NaN,1:NaN}
>
>
> I already have the same problem when training and testing with the full
> dataset but there, about 15% of the data still has values in output and
> gets predicted, the rest being NaN and unpredicted.
>
> Could you help me see what could be causing that ?
>
> Thanks in advance
> Bests,
>
> Kévin Moulart
>

Re: PCA to improve classification performances

Reply via email to