Re: Use Naïve Bayes on a large CSV

Ted Dunning Mon, 24 Feb 2014 06:39:06 -0800

Kevin,

While this is fresh in your mind can you prepare a javadoc patch that would
have helped you out?  And suggest other doc patches as well?




On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <kevinmoul...@gmail.com>wrote:

> Thanks, that's about the clearest answer I got so far :)
>
>
> 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <s...@apache.org>:
>
> > NaiveBayes expects a SequenceFile as input. The key is the class label as
> > Text, the value are the features as VectorWritable.
> >
> > --sebastian
> >
> >
> > On 02/24/2014 11:51 AM, Kevin Moulart wrote:
> >
> >> Hi again,
> >> I finally set my mind on going through java to make a sequence file for
> >> the
> >> naive bayes,
> >> but I still can't manage to find anyplace stating exactly what should be
> >> in
> >> the sequence file
> >> for mahout to process it with Naive Bayes.
> >>
> >> I tried virtually every piece of code i found related to this subject,
> >> with
> >> no luck.
> >>
> >> My CSV file is like this :
> >> Label that I want to predict, feature 1, feature 2, ..., feature 1628
> >>
> >> Could someone tell me exactly what Naive Bayes training procedure
> expects
> >> ?
> >>
> >>
> >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <jayunit...@gmail.com>:
> >>
> >>  This relates to a previous question I have:  Does mahout have a concept
> >>> of
> >>> adapters which allow us to read data csv style data with filters to
> >>> create
> >>> exact format  for its various inputs (i.e. Recommender three column
> >>> format).?  If not is it worth a jira?
> >>>
> >>>
> >>>  On Feb 20, 2014, at 7:50 AM, Kevin Moulart <kevinmoul...@gmail.com>
> >>>>
> >>> wrote:
> >>>
> >>>>
> >>>> Hi and thanks !
> >>>>
> >>>> What about the command line, is there a way to do that using the
> >>>> existing
> >>>> command line ?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <suneel_mar...@yahoo.com>:
> >>>>
> >>>>  To convert input CSV to vectors, u can either:
> >>>>>
> >>>>> a) Use CSVIterator
> >>>>> b) use InputDriver
> >>>>>
> >>>>> Either of the above should generate vectors from input CSV that could
> >>>>>
> >>>> then
> >>>
> >>>> be fed into Mahout classifier/clustering jobs.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
> >>>>> kevinmoul...@gmail.com> wrote:
> >>>>>
> >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file
> >>>>> from
> >>>>> the command line.
> >>>>>
> >>>>> I know I have to feed the classifier with a seq file, so I tried to
> put
> >>>>>
> >>>> my
> >>>
> >>>> csv into one using the command seqdirectory, but even when I try with
> a
> >>>>> really small csv (less than 100Mo) I instantly get an
> >>>>>
> >>>> outOfMemoryException
> >>>
> >>>> from java heap space :
> >>>>>
> >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
> >>>>>
> >>>> "/user/cacf/resSeq"
> >>>
> >>>> -ow
> >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >>>>>> Running on hadoop, using
> >>>>>>
> >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
> >>>
> >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
> >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
> >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
> >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
> >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
> >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
> >>>>>> --output=[/user/cacf/resSeq],
> >>>>>>
> >>>>> --overwrite=null, --startPhase=[0],
> >>>>>
> >>>>>> --tempDir=[temp]}
> >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
> >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> space
> >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367)
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  java.lang.AbstractStringBuilder.expandCapacity(
> >>> AbstractStringBuilder.java:130)
> >>>
> >>>> at
> >>>>>>
> >>>>>
> >>>>>  java.lang.AbstractStringBuilder.ensureCapacityInternal(
> >>> AbstractStringBuilder.java:114)
> >>>
> >>>> at
> >>>>>>
> >>>>>
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> >>>>>
> >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  org.apache.mahout.text.PrefixAdditionFilter.process(
> >>> PrefixAdditionFilter.java:62)
> >>>
> >>>> at
> >>>>>>
> >>>>>
> >>>>>  org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(
> >>> SequenceFilesFromDirectoryFilter.java:90)
> >>>
> >>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
> >>>>>> at
> >>>>>>
> >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
> >>>>>
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.run(
> >>> SequenceFilesFromDirectory.java:98)
> >>>
> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.main(
> >>> SequenceFilesFromDirectory.java:53)
> >>>
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
> >>> NativeMethodAccessorImpl.java:57)
> >>>
> >>>> at
> >>>>>>
> >>>>>
> >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
> >>> DelegatingMethodAccessorImpl.java:43)
> >>>
> >>>> at java.lang.reflect.Method.invoke(Method.java:606)
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
> >>> ProgramDriver.java:72)
> >>>
> >>>> at
> >>>>>>
> >>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> >>>>>
> >>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
> >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>>> at
> >>>>>>
> >>>>>
> >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
> >>> NativeMethodAccessorImpl.java:57)
> >>>
> >>>> at
> >>>>>>
> >>>>>
> >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
> >>> DelegatingMethodAccessorImpl.java:43)
> >>>
> >>>> at java.lang.reflect.Method.invoke(Method.java:606)
> >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Do you have an idea or a simple way to use Naive Bayes against my
> large
> >>>>>
> >>>> CSV
> >>>
> >>>> ?
> >>>>>
> >>>>> Thanks in advance !
> >>>>> --
> >>>>> Kévin Moulart
> >>>>> GSM France : +33 7 81 06 10 10
> >>>>> GSM Belgique : +32 473 85 23 85
> >>>>> Téléphone fixe : +32 2 771 88 45
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Kévin Moulart
> >>>> GSM France : +33 7 81 06 10 10
> >>>> GSM Belgique : +32 473 85 23 85
> >>>> Téléphone fixe : +32 2 771 88 45
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
>
>
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45
>

Re: Use Naïve Bayes on a large CSV

Reply via email to