Re: Use Naïve Bayes on a large CSV

Kevin Moulart Tue, 25 Feb 2014 07:56:33 -0800

For information purpose, this is the program creating the sequence file :

public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
> Configuration conf = new Configuration(true);
>  FileSystem fs = FileSystem.get(conf);
>
> // The input file is not in hdfs
>  BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> Path filePath = new Path(args[2]);
> // Delete previous file if exists
>  if (fs.exists(filePath))
> fs.delete(filePath, true);
> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>  filePath, Text.class, VectorWritable.class);
> // Run through the input file
> String line;
>  while ((line = reader.readLine()) != null) {
> // We surround with try catch to get rid of the exception when header is
> included in file
>  try {
> // Split with the given separator
> String[] c = line.split(args[3]);
>  if (c.length > 1) {
> double[] d = new double[c.length];
> // Get the feature set
>  for (int i = 1; i < c.length; i++)
> d[i] = Double.parseDouble(c[i]);
> // Put it in a vector
>  Vector vec = new RandomAccessSparseVector(c.length);
> vec.assign(d);
> VectorWritable writable = new VectorWritable();
>  writable.set(vec);
>
> // Create a label with a / and the class label
>  String label = c[0] + "/" + c[0];
>
> // Write all in the seqfile
>  writer.append(new Text(label), writable);
> }
> } catch (NumberFormatException e) {
>  continue;
> }
> }
> writer.close();
>  reader.close();
> }




2014-02-25 16:25 GMT+01:00 Kevin Moulart <kevinmoul...@gmail.com>:

> I finally managed to make it run, I had to format the class label in the
> input file with a / in the name so I put Yes/1 or No/0 instead of just 1 or
> 0.
>
> But then I noticed when testing the model that it doesn't classify all the
> data :
> 14/02/25 16:16:30 INFO mapred.JobClient:   Map-Reduce Framework
> 14/02/25 16:16:30 INFO mapred.JobClient:     Map input records=*300000*
> 14/02/25 16:16:30 INFO mapred.JobClient:     Map output records=300000
> 14/02/25 16:16:30 INFO mapred.JobClient:     Input split bytes=476
> 14/02/25 16:16:30 INFO mapred.JobClient:     Spilled Records=0
> 14/02/25 16:16:30 INFO mapred.JobClient:     CPU time spent (ms)=32000
> 14/02/25 16:16:30 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=834502656
> 14/02/25 16:16:30 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=3738030080
> 14/02/25 16:16:30 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=918552576
> 14/02/25 16:16:31 INFO test.TestNaiveBayesDriver: Standard NB Results:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :      36078   91.3552%
> Incorrectly Classified Instances        :       3414    8.6448%
> Total Classified Instances              :      *39492*
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a     b     <--Classified as
> 34445 2114  |  36559 a     = 0
> 1300 1633  |  2933   b     = 1
>
>
> I did the testnb with the exact same file I used to train the model.
>
> Any idea ?
>
>
> 2014-02-25 11:33 GMT+01:00 Kevin Moulart <kevinmoul...@gmail.com>:
>
> All right I've manage to narrow it down to the LabelIndex, I went to see
>> the code but it isnt realy clear at all for me. What exactly should I
>> provide as Label Index ?
>>
>> As a reminder, one line of my original file i=looks like :
>> 0, 0.3222, 0, 1.543, ...
>> 1, 0, 1.42, 1.12, ...
>>
>> With the 0, 1 being the labels I'm trying to learn and the rest being the
>> data.
>>
>> For now I have the previously mentionned java code that creates the
>> SequenceFile from my CSV, but when I then try to run the trainnb on it it
>> tries to create a LabelIndex and fails with an ArrayOutOfBoundException: 1.
>>
>> Could someone tell me how to create the index, even manually at this
>> point ?
>>
>> Thanks in advance !
>>
>>
>> 2014-02-24 15:41 GMT+01:00 Kevin Moulart <kevinmoul...@gmail.com>:
>>
>> I'll do that as soon as I manage to make it work ^^', that's a great idea
>>> !
>>>
>>> I'm stuck with this for now :
>>>
>>> public static void main(String[] args) throws IOException,
>>>> InterruptedException, ClassNotFoundException {
>>>> Configuration conf = new Configuration(true);
>>>>  FileSystem fs = FileSystem.get(conf);
>>>> BufferedReader reader = new BufferedReader(new FileReader(args[1]));
>>>> Path filePath = new Path(args[2]);
>>>>  if (fs.exists(filePath))
>>>> fs.delete(filePath, true);
>>>> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>>>>  filePath, Text.class, VectorWritable.class);
>>>> try {
>>>> String line;
>>>> while ((line = reader.readLine()) != null) {
>>>>  String[] c = line.split(args[3]);
>>>> if (c.length > 1) {
>>>> double[] d = new double[c.length];
>>>>  for (int i = 1; i < c.length; i++)
>>>> d[i] = Double.parseDouble(c[i]);
>>>> Vector vec = new RandomAccessSparseVector(c.length);
>>>>  vec.assign(d);
>>>> VectorWritable writable = new VectorWritable();
>>>> writable.set(vec);
>>>>  writer.append(new Text(c[0]), writable);
>>>> }
>>>> }
>>>> writer.close();
>>>>  } catch (Throwable t) {
>>>> t.printStackTrace();
>>>> }
>>>> reader.close();
>>>>  }
>>>
>>>
>>> Which produces a sequence file but Mahout's trainnb doesn't seem to like
>>> it that much, so I'm working on it for the moment.
>>>
>>>
>>> 2014-02-24 15:37 GMT+01:00 Ted Dunning <ted.dunn...@gmail.com>:
>>>
>>> Kevin,
>>>>
>>>> While this is fresh in your mind can you prepare a javadoc patch that
>>>> would
>>>> have helped you out?  And suggest other doc patches as well?
>>>>
>>>>
>>>>
>>>> On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <kevinmoul...@gmail.com
>>>> >wrote:
>>>>
>>>> > Thanks, that's about the clearest answer I got so far :)
>>>> >
>>>> >
>>>> > 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <s...@apache.org>:
>>>> >
>>>> > > NaiveBayes expects a SequenceFile as input. The key is the class
>>>> label as
>>>> > > Text, the value are the features as VectorWritable.
>>>> > >
>>>> > > --sebastian
>>>> > >
>>>> > >
>>>> > > On 02/24/2014 11:51 AM, Kevin Moulart wrote:
>>>> > >
>>>> > >> Hi again,
>>>> > >> I finally set my mind on going through java to make a sequence
>>>> file for
>>>> > >> the
>>>> > >> naive bayes,
>>>> > >> but I still can't manage to find anyplace stating exactly what
>>>> should be
>>>> > >> in
>>>> > >> the sequence file
>>>> > >> for mahout to process it with Naive Bayes.
>>>> > >>
>>>> > >> I tried virtually every piece of code i found related to this
>>>> subject,
>>>> > >> with
>>>> > >> no luck.
>>>> > >>
>>>> > >> My CSV file is like this :
>>>> > >> Label that I want to predict, feature 1, feature 2, ..., feature
>>>> 1628
>>>> > >>
>>>> > >> Could someone tell me exactly what Naive Bayes training procedure
>>>> > expects
>>>> > >> ?
>>>> > >>
>>>> > >>
>>>> > >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <jayunit...@gmail.com>:
>>>> > >>
>>>> > >>  This relates to a previous question I have:  Does mahout have a
>>>> concept
>>>> > >>> of
>>>> > >>> adapters which allow us to read data csv style data with filters
>>>> to
>>>> > >>> create
>>>> > >>> exact format  for its various inputs (i.e. Recommender three
>>>> column
>>>> > >>> format).?  If not is it worth a jira?
>>>> > >>>
>>>> > >>>
>>>> > >>>  On Feb 20, 2014, at 7:50 AM, Kevin Moulart <
>>>> kevinmoul...@gmail.com>
>>>> > >>>>
>>>> > >>> wrote:
>>>> > >>>
>>>> > >>>>
>>>> > >>>> Hi and thanks !
>>>> > >>>>
>>>> > >>>> What about the command line, is there a way to do that using the
>>>> > >>>> existing
>>>> > >>>> command line ?
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <
>>>> suneel_mar...@yahoo.com>:
>>>> > >>>>
>>>> > >>>>  To convert input CSV to vectors, u can either:
>>>> > >>>>>
>>>> > >>>>> a) Use CSVIterator
>>>> > >>>>> b) use InputDriver
>>>> > >>>>>
>>>> > >>>>> Either of the above should generate vectors from input CSV that
>>>> could
>>>> > >>>>>
>>>> > >>>> then
>>>> > >>>
>>>> > >>>> be fed into Mahout classifier/clustering jobs.
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
>>>> > >>>>> kevinmoul...@gmail.com> wrote:
>>>> > >>>>>
>>>> > >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV
>>>> file
>>>> > >>>>> from
>>>> > >>>>> the command line.
>>>> > >>>>>
>>>> > >>>>> I know I have to feed the classifier with a seq file, so I
>>>> tried to
>>>> > put
>>>> > >>>>>
>>>> > >>>> my
>>>> > >>>
>>>> > >>>> csv into one using the command seqdirectory, but even when I try
>>>> with
>>>> > a
>>>> > >>>>> really small csv (less than 100Mo) I instantly get an
>>>> > >>>>>
>>>> > >>>> outOfMemoryException
>>>> > >>>
>>>> > >>>> from java heap space :
>>>> > >>>>>
>>>> > >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
>>>> > >>>>>
>>>> > >>>> "/user/cacf/resSeq"
>>>> > >>>
>>>> > >>>> -ow
>>>> > >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>>> > >>>>>> Running on hadoop, using
>>>> > >>>>>>
>>>> > >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
>>>> > >>>
>>>> > >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
>>>> > >>>>>> MAHOUT-JOB:
>>>> /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
>>>> > >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line
>>>> arguments:
>>>> > >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
>>>> > >>>>>>
>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
>>>> > >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
>>>> > >>>>>> --output=[/user/cacf/resSeq],
>>>> > >>>>>>
>>>> > >>>>> --overwrite=null, --startPhase=[0],
>>>> > >>>>>
>>>> > >>>>>> --tempDir=[temp]}
>>>> > >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting
>>>> /user/cacf/resSeq
>>>> > >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java
>>>> heap
>>>> > space
>>>> > >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  java.lang.AbstractStringBuilder.expandCapacity(
>>>> > >>> AbstractStringBuilder.java:130)
>>>> > >>>
>>>> > >>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  java.lang.AbstractStringBuilder.ensureCapacityInternal(
>>>> > >>> AbstractStringBuilder.java:114)
>>>> > >>>
>>>> > >>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
>>>> > >>>>>
>>>> > >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  org.apache.mahout.text.PrefixAdditionFilter.process(
>>>> > >>> PrefixAdditionFilter.java:62)
>>>> > >>>
>>>> > >>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(
>>>> > >>> SequenceFilesFromDirectoryFilter.java:90)
>>>> > >>>
>>>> > >>>> at
>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
>>>> > >>>>>
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.run(
>>>> > >>> SequenceFilesFromDirectory.java:98)
>>>> > >>>
>>>> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>> > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.main(
>>>> > >>> SequenceFilesFromDirectory.java:53)
>>>> > >>>
>>>> > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>>> > >>> NativeMethodAccessorImpl.java:57)
>>>> > >>>
>>>> > >>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>> > >>> DelegatingMethodAccessorImpl.java:43)
>>>> > >>>
>>>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
>>>> > >>> ProgramDriver.java:72)
>>>> > >>>
>>>> > >>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
>>>> > >>>>>
>>>> > >>>>>> at
>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
>>>> > >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> > >>>>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>>> > >>> NativeMethodAccessorImpl.java:57)
>>>> > >>>
>>>> > >>>> at
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>> > >>> DelegatingMethodAccessorImpl.java:43)
>>>> > >>>
>>>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> > >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> Do you have an idea or a simple way to use Naive Bayes against
>>>> my
>>>> > large
>>>> > >>>>>
>>>> > >>>> CSV
>>>> > >>>
>>>> > >>>> ?
>>>> > >>>>>
>>>> > >>>>> Thanks in advance !
>>>> > >>>>> --
>>>> > >>>>> Kévin Moulart
>>>> > >>>>> GSM France : +33 7 81 06 10 10
>>>> > >>>>> GSM Belgique : +32 473 85 23 85
>>>> > >>>>> Téléphone fixe : +32 2 771 88 45
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> --
>>>> > >>>> Kévin Moulart
>>>> > >>>> GSM France : +33 7 81 06 10 10
>>>> > >>>> GSM Belgique : +32 473 85 23 85
>>>> > >>>> Téléphone fixe : +32 2 771 88 45
>>>> > >>>>
>>>> > >>>
>>>> > >>>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >
>>>> >
>>>> >
>>>> > --
>>>> > Kévin Moulart
>>>> > GSM France : +33 7 81 06 10 10
>>>> > GSM Belgique : +32 473 85 23 85
>>>> > Téléphone fixe : +32 2 771 88 45
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Kévin Moulart
>>> GSM France : +33 7 81 06 10 10
>>> GSM Belgique : +32 473 85 23 85
>>> Téléphone fixe : +32 2 771 88 45
>>>
>>
>>
>>
>> --
>> Kévin Moulart
>> GSM France : +33 7 81 06 10 10
>> GSM Belgique : +32 473 85 23 85
>> Téléphone fixe : +32 2 771 88 45
>>
>
>
>
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45
>



-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Use Naïve Bayes on a large CSV

Reply via email to