Re: Mahout classification issue

2013-01-03 Thread Robin Anil
The training format is not one per line(It used to be in a previous implementations). Its one per file. Take a look at the 20news example Robin Anil | Software Engineer | robin.a...@gmail.com | Google Inc. On Thu, Jan 3, 2013 at 3:11 AM, work_silicon wrote: > Hello there, > > I u

Re: Memory Requirements of Naïve Bayes?

2013-01-03 Thread Robin Anil
the input vectors like those with < 5 frequency. Robin Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Thu, Jan 3, 2013 at 2:23 PM, Adam Baron wrote: > I'm trying to run Naïve Bayes on 2.4GB of tfidf-vectors representing a > bunch of 1-, 2-, 3-grams. However,

Re: How to segment seq2sparse output into predefined training set and test set?

2013-01-04 Thread Robin Anil
Or use seq2encoded, its does randomized hashing instead of tfidf, the performance as I have seen is identical to the seq2sparse and much lower in model size (if you give it a lower dimension to project on) Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Fri, Jan 4, 2013 at 7

Re: Memory Requirements of Naïve Bayes?

2013-01-04 Thread Robin Anil
Use seq2encoded instead to create smaller vectors. See the other thread. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Thu, Jan 3, 2013 at 3:47 PM, Robin Anil wrote: > Model is bounded by the feature space. So if you are using uptil trigrams, > you need to estima

Re: How to use Naive Bayes Classifier to classify new data?

2013-04-16 Thread Robin Anil
consistent vector. 3. Once thats done, run seq2encoded on a directory of text documents that are not seen (which includes your message.txt among others). and run testnb on it. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Mon, Apr 15, 2013 at 7:09 PM, Brian Feeny wrote

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-16 Thread Robin Anil
Sounds like a config issue. the Mr version should be able to parallelize based on the size of the input.

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Robin Anil
You wont its tiny amount of data. Mapper are determined by the split size and input shards. Either shard the input more than 10 or reduce the map split size. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Apr 17, 2013 at 3:32 PM, Ryan Compton wrote: > Any ideas where

Re: DenseRowMatrix?

2013-04-17 Thread Robin Anil
SparseRowMatrix? On Apr 17, 2013 5:26 PM, "Gokhan Capan" wrote: > Hi, > > Using Mahout Matrix interface I want to represent some data where the row > vector is dense iff an instance is associated to this row index, empty > otherwise. The max possible index for rows (a.k.a. rowSize) is defined. >

Re: DenseRowMatrix?

2013-04-17 Thread Robin Anil
Make one? On Apr 17, 2013 5:37 PM, "Gokhan Capan" wrote: > Robin, > > Aren't SparseRowMatrix rows are sparse vectors? In my use case row vectors > don't need to be sparse, they are either full or empty. > > > On Thu, Apr 18, 2013 at 1:32 AM, Robin Anil

Re: DenseRowMatrix?

2013-04-17 Thread Robin Anil
Yes! Yes! Go for it!. On Apr 17, 2013 5:52 PM, "Gokhan Capan" wrote: > I didn't quite get that, and assuming you tell me to implement it > > Thanks > > > On Thu, Apr 18, 2013 at 1:44 AM, Robin Anil wrote: > > > Make one? > > On Apr 17, 2013

Re: What's the difference between "trainnb" and "trainclassifier -type bayes"?

2013-04-18 Thread Robin Anil
rates vectors using hashing trick) See examples/bin/classify-20newsgroups.sh Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Thu, Apr 18, 2013 at 9:03 PM, Ryan Compton wrote: > When I use "trainclassifier" I am able to run the 20 news groups just > fine. I'm

Re: mahout colt collections

2013-05-21 Thread Robin Anil
I think you forgot to attach the test file On May 21, 2013 7:30 AM, "Sophie Sperner" wrote: > Link to hhpc jar file - > http://labs.carrotsearch.com/hppc-download.htmlthen press Download > button on the right. > > > On 21 May 2013 13:23, Sophie Sperner wrote: > > > Dear Dan, all, > > > > I do no

Re: df-count/data does not exist

2011-07-22 Thread Robin Anil
What does the folder input vectordata contain? I am guessing you gave the top level directory instead of giving the tfidf-vectors folder as input Robin On Fri, Jul 22, 2011 at 8:33 PM, Liliana Mamani Sanchez wrote: > Hello all, > > I was trying to run a basic canopy clustering command: > > > bin

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Robin Anil
We dropped it after pruning the dependencies in Mahout. You can simply bring back the class(from the repository) and use it to connect to HBase in your client code. Robin On Mon, Jul 25, 2011 at 6:23 PM, NightWolf wrote: > Hi all, > > I'm working on a large text classification project and we ha

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Robin Anil
hese > documents directly from and to HBase rather than using HDFS? > > Thanks, > NW > > On Mon, Jul 25, 2011 at 11:03 PM, Robin Anil wrote: > > > We dropped it after pruning the dependencies in Mahout. You can simply > > bring > > back the class(from the repos

Re: Parallel FPGrowth driver - doc problem?

2011-07-27 Thread Robin Anil
Its outdated. This page predates the bin/mahout fpg launcher, so some of the sections uses the mvn exec plugin directly. On Tue, Jul 26, 2011 at 10:11 PM, Lance Norskog wrote: > The FPGrowth driver page: > > https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining > >

Re: Parallel FPGrowth driver - what is a good demo?

2011-07-27 Thread Robin Anil
On Tue, Jul 26, 2011 at 11:06 PM, Lance Norskog wrote: > The parameters and files mentioned on this page do not find any > frequent patterns: > > https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining Let me run and correct this doc. > > > Have 'accidents.dat.gz' fr

Re: Kmeans runs successfully, but no map/reduce jobs

2011-07-27 Thread Robin Anil
Have you verified that the Sequence file in the input folder is having valid records? Robin On Wed, Jul 27, 2011 at 4:20 PM, Dave Gettier wrote: > > I am running a kmeans application which was adapted from example 7.2 of > Mahout in Action. The java program runs successfully, giving me the >

Re: #mahout IRC

2011-08-30 Thread Robin Anil
I used to be the only one hanging around there 4 years ago during Gsoc. On Tue, Aug 30, 2011 at 6:47 PM, Dhruv Kumar wrote: > Interesting indeed! > > On Tue, Aug 30, 2011 at 9:06 AM, Ted Dunning > wrote: > > > Interesting. I didn't know about the IRC channel. The hbase group uses > > IRC > >

Re: bug when generating sparse vector

2011-09-05 Thread Robin Anil
use the sequence file dumper to inspect the files bin/mahout seqdumper --help On Tue, Sep 6, 2011 at 10:03 AM, Walter Chang wrote: > i ended up add a default SmartChineseAnalyzer constructor to get around > with > the issue. I have another question. Right now, I can see the following > directori

Re: Default value of numGroups in FPGrowthJob

2011-09-06 Thread Robin Anil
you can do --help on the command to see the current default value for any flag On Tue, Sep 6, 2011 at 2:43 PM, yuji anzai wrote: > Hi, > > I tried FPGrowthJob using mahout command below. > 1. $MAHOUT_HOME/bin/mahout fpg -i /input/retail.dat -o /output/numgdef/ > -method mapreduce -regex '[\s]' -

Re: bug when generating sparse vector

2011-09-06 Thread Robin Anil
Yes. See the default values for minSupport using --help on seq2sparse command On Tue, Sep 6, 2011 at 12:47 PM, Walter Chang wrote: > works. After dumping the content, it seems the tokenized > document seems to be correct. However, the word count doesn't contain > the > term that has only 1 oc

Re: Ehcache and Mahout

2011-09-10 Thread Robin Anil
I once wrote a simple cache for HBaseDatastore in naive Bayes classifier package and yes the speedup was really awesome, weights of high freq words got cached and incremental lookup for rest of the words in a document was really low. I had posted numbers on the old JIRA ticket On Sep 11, 2011 12:3

Re: (C)NB classifier scores

2011-09-15 Thread Robin Anil
Smaller is better(negative number so largest of the negative number in absolute value), this is to say if you have the lowest affinity to the complement class, you have highest affinity to the actual class which the data belongs to. Unless the new computation is spitting out positive numbers in whi

Re: (C)NB classifier scores

2011-09-15 Thread Robin Anil
yes On Thu, Sep 15, 2011 at 10:39 PM, Grant Ingersoll wrote: > Sorry for my poor wording. > > Just to confirm: > for CNB, smaller is better? > for NB, larger is better? > > On Sep 15, 2011, at 12:23 PM, Robin Anil wrote: > > > Smaller is better(negative number so

Re: 92% accuracy on Weka NaiveBayesMultinomial vs 66% with Mahout bayes

2011-09-16 Thread Robin Anil
Did you try complementary naive bayes(CNB). I am guessing the multinomial naivebayes mentioned here is a CNB like implementation and not NB. On Fri, Sep 16, 2011 at 5:30 PM, Benjamin Rey wrote: > Hello, > > I'm giving a try to different classifiers for a classical problem of text > classificatio

Re: PfgGrowth job got stuck when run into fpGrowth.generateTopKFrequentPatterns

2011-09-22 Thread Robin Anil
There was a similar bug which was fixed in the trunk version. Let me dig that Jira ticket up. On Thu, Sep 22, 2011 at 7:37 PM, bing wang wrote: > PfpGrowth is also ok with the retails an daccidents dataset( > http://fimi.ua.ac.be/data/) when runing over my cluster. My cluster has > 1500+ nodes.

Re: PfgGrowth job got stuck when run into fpGrowth.generateTopKFrequentPatterns

2011-09-22 Thread Robin Anil
https://issues.apache.org/jira/browse/MAHOUT-629 Try with this patch. On Thu, Sep 22, 2011 at 8:53 PM, Robin Anil wrote: > There was a similar bug which was fixed in the trunk version. Let me dig > that Jira ticket up. > > > > On Thu, Sep 22, 2011 at 7:37 PM, bing wang wro

Re: Bayes/CBayes classification on a non-existing feature

2011-09-29 Thread Robin Anil
Looks like a bug. I am interested to see the differences in quality for the 20newsgroups example 2011/9/29 André-Philippe Paquet > causing

Re: Diagnosing naive bayes results

2012-01-28 Thread Robin Anil
If the score is 0, then it's category is assumed as default. If there is a score, then naive bayes takes the largest scored, and cnb takes the lowest scored category. -- Robin Anil On Sat, Jan 28, 2012 at 5:32 PM, Stuart Smith wrote: > > > Any idea if there is a default

Re: Filter out small docs

2012-03-31 Thread Robin Anil
I would suggesting adding a preprocess step to generate the input sequence file which mahout reads instead of relying on the seqdirectory tool. Most of the time you will spend tuning will be mostly in tweaking your processed document. -- Robin Anil On Sat, Mar 31, 2012 at 12:02 PM, Pat

Re: Naive Bayes training filling up jobcache

2012-04-03 Thread Robin Anil
which version are you using? bayes.* or naivebayes.* -- Robin Anil On Tue, Apr 3, 2012 at 2:26 PM, Stuart Smith wrote: > Hello all, > > I've got Naive Bayes working pretty good. Now I want to train a much > bigger model. From about 100,000 samples in each category to

Re: Naive Bayes training filling up jobcache

2012-04-03 Thread Robin Anil
control how much space gets used a little better... > > Take care, >-stu > >-- > *From:* Robin Anil > *To:* user@mahout.apache.org; Stuart Smith > *Cc:* Mahout List > *Sent:* Tuesday, April 3, 2012 1:00 PM > *Subject:* Re: Naiv

Re: CBayes Input

2012-04-12 Thread Robin Anil
In the command line example replace "bayes" with "cbayes". That's all you need to do. On Apr 12, 2012 7:29 AM, "Lithium Guava" wrote: > Hi, > > I've played with the bayes 20newsgroups example, but I'd like to try > running the cbayes algorithm on it also. The example script doesn't seem to > off

Re: Classification: using the Java API always returns the same category

2012-04-12 Thread Robin Anil
Can you print the logs when you run your code. -- Robin Anil On Thu, Apr 12, 2012 at 11:25 AM, Verachten Bruno wrote: > Hi, > > I use mahout 0.5 with hadoop 1.0.1. > I have a model for four categories that I got with: > mahout trainclassifier -i train -o model -type cbayes

Re: Classification: using the Java API always returns the same category

2012-04-13 Thread Robin Anil
This shows that category3 is being selected for your input string. I dont see any apparent problems. Can you try to run over the training data and see if the models is predicting right in your api version, just as a sanity check. Again send logs of the run. -- Robin Anil 2012/4/13 Verachten

Re: A Mahout Naive Bayes classifier problem

2012-05-04 Thread Robin Anil
Can you provide the console output when you run train or test On May 4, 2012 8:09 AM, "Zehao Jin" wrote: > ** > Dear all, > I'm a mahout beginner, I need to use the mahout Naive Bayes classifier for > text classification.To get started, I followed the example of Twenty > NewsGroup: > 1.Start the

Re: Java Heap Error: ItemSimilarityJob

2012-06-06 Thread Robin Anil
This should be baked in by default. I don't think people use less that 4g these days On Jun 6, 2012 12:24 PM, "Vinod Singh" wrote: > Child heap size can be increased by passing command line options as well. > See the example given below- > > -Dmapred.map.child.java.opts=-Xmx6100m > -Dmapred.reduc

Re: Problem using SNAPSHOT kmeans

2012-06-06 Thread Robin Anil
yes -- Robin Anil On Wed, Jun 6, 2012 at 4:48 PM, Jeff Eastman wrote: > ... and the problem presents when both dotProduct and denominator are > zero. It seems unreasonable for k-means to fail to cluster zero vectors in > this case. Seems like in this case the distance ought to return 1. >

Re: n-gram and ml

2012-06-09 Thread Robin Anil
-- Robin Anil On Sat, Jun 9, 2012 at 10:27 AM, Pat Ferrel wrote: > As I understand it when using seq2sparse with ng = 2 and ml = some large > number. This will never create a vector with less terms than words (all > other pars of the algorithm set aside). In other words ng =

Re: n-gram and ml

2012-06-09 Thread Robin Anil
de vectors if they are empty in the encoder job. For you, things might just work ok now as the original distance measure bug is fixed. Robin ------ Robin Anil On Sat, Jun 9, 2012 at 5:03 PM, Pat Ferrel wrote: > OK, thanks. I'm trying to find ways to reduce dimensionality in some >

Re: n-gram and ml

2012-06-09 Thread Robin Anil
https://issues.apache.org/jira/browse/MAHOUT-1031 Well use this for now. Wiring it to be a flag is too much wiring work. There should be a better way to check if vector is empty, so I am not going to submit this during code freeze. -- Robin Anil On Sat, Jun 9, 2012 at 6:08 PM, Pat Ferrel

Re: threshold for the complementary naive bayes

2012-06-09 Thread Robin Anil
What kind of threshold? -- Robin Anil On Sat, Jun 9, 2012 at 2:46 PM, Anatoli Matuskova < anatoli.matusk...@gmail.com> wrote: > Hey there, > How could a threshold be established for the classifier using complementary > naive bayes? > > -- > View this message in

Fwd: Apache Mahout 0.7 Released

2012-06-19 Thread Robin Anil
-- Forwarded message -- From: "Paritosh Ranjan" Date: Jun 16, 2012 5:45 AM Subject: Apache Mahout 0.7 Released To: "d...@mahout.apache.org" Apache Mahout has reached version 0.7. All developers are encouraged to begin using version 0.7. Highlights include: -Outlier removal capab

Re: Mahout bayes classifier parameters

2012-07-02 Thread Robin Anil
Also have you looked at the new naive bayes package. It is super fast. On Jul 2, 2012 8:21 AM, "Ted Dunning" wrote: > Did you read the original Naive Bayes paper? > > On Mon, Jul 2, 2012 at 12:47 AM, damodar shetyo >wrote: > > > While using Bayes classier in Mahout we set parameters as follows:

Re: poor classifier results

2012-07-07 Thread Robin Anil
Can you list down command line used. On Jul 7, 2012 3:48 PM, "Alexander Aristov" wrote: > People, > > I am implementing Naive Bayes classifier on my text data and get poor > results. > > Self-Testing on trained data gives 95% pos and 5% neg results (not bad). > But testing on hold out set gives 6

Re: poor classifier results

2012-07-08 Thread Robin Anil
Try using encodedvectorsfromsequencefile On Jul 8, 2012 2:04 AM, "Alexander Aristov" wrote: > So what numbers shall I think about? 100,1000 training files per category? > > When you was writingL1 regularized logistic regression did you mean SGD > algorithm? Can I take it from example? > > thanks

Re: Naive Bayes classification questions

2012-07-21 Thread Robin Anil
-- Robin Anil On Fri, Jul 20, 2012 at 3:27 PM, David Engel wrote: > Hi, > > I have a couple of questions regarding Naive Bayes classification in > Mahout 0.7. > > Is there a preferred way to determine when a document doesn't belong > to any of the given categori

Re: EMC Israel Data Science Challenge- classify open source to the project

2012-07-25 Thread Robin Anil
This look like a job for Mahout CNB :) -- Robin Anil On Wed, Jul 25, 2012 at 10:42 AM, Daniel Glauser wrote: > As a former VMware employee (VMware is essentially owned by EMC) I know > they go through a laborious legal process to vet open source libraries to > make sure they won&#

Re: non-text NB classifiers?

2012-07-31 Thread Robin Anil
You can pass in any vector(not just a tfidf vector). For example the asf-email example script using Vectors generated using the randomized encoding. -- Robin Anil On Tue, Jul 31, 2012 at 12:26 PM, Sean Owen wrote: > I don't know this code too much, but, there is simply a step in f

Re: non-text NB classifiers?

2012-07-31 Thread Robin Anil
its EncodedVectorsFromSequenceFiles.java I believe -- Robin Anil On Tue, Jul 31, 2012 at 6:05 PM, Eric Friedman wrote: > Can you point me to the class I should look at to see how this is done? > > On Tue, Jul 31, 2012 at 10:49 AM, Robin Anil wrote: > > You can pass in any ve

Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-01 Thread Robin Anil
Tfidf job is where the document frequency pruning is applied. Try increasing maxDFPercent to 100 % On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel wrote: > Hello! > > I have trouble running the example "seq2sparse" with TFIDF weights. My TF > vectors are Ok, while TFIDF vectors are 10 times smalle

Re: Several questions about Mahout

2010-05-11 Thread Robin Anil
Hi Guillaume On Tue, May 11, 2010 at 6:32 PM, Guillaume Billard wrote: > Hello, > > My company is looking into creating a website for clothes shopping built > around a recommendation engine. User criteria would be past purchases and > items that have been looked at (à la Amazon), measurements, per

Re: Who owns mahout bucket on s3?

2010-05-22 Thread Robin Anil
Anyone wants to mirror these. I am clearing out my account. If none is taking, I will copy them over my home dir @ people.apache.org http://mahout-wikipedia.s3.amazonaws.com/wikipedia -jan-2010-seqfile-deflate-chunk-[0-5]

Re: Mahout LDA Parameter: maxIter

2010-05-22 Thread Robin Anil
David's rule of thumb was to let the iterations go until relative change in LL becomes around 10^-4 Robin On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman wrote: > I suggest you try running with a trunk checkout and upgrading to Hadoop > 0.20.2. Mahout is still in motion and I've run LDA on Reuters

Re: Mahout LDA Parameter: maxIter

2010-05-23 Thread Robin Anil
dataset is also the same with mahout 0.3 (on > which the experiment works ok except for *only one map* in each > iteration~). > > Is it because of absence of some other patches? Or is there any other > mistakes in my operations? > > Thank you! > > > On Sun, May 23, 20

Re: M/R Job for Log file to FPG

2010-05-27 Thread Robin Anil
fpg uses regex to split. Just add another option for using the regex to match instead of splitting. Less work I guess On Fri, May 28, 2010 at 2:42 AM, Grant Ingersoll wrote: > I'd like to take a bunch of logs and extract a bit of each line and then put > them into format for FPG.  Was thinking

Re: M/R Job for Log file to FPG

2010-05-28 Thread Robin Anil
On Fri, May 28, 2010 at 7:39 PM, Grant Ingersoll wrote: > Robin, > > What I'll do here is make the code reusable so that we can use it in FPG > directly as well. > Cool. Btw there is one more thing missing. Make sure each item in an itemset to the algorithm is formed of unique tokens. I dont thi

Re: M/R Job for Log file to FPG

2010-05-28 Thread Robin Anil
oops that should be 4 instead of 3.

Re: --input now -Dmapred.input.dir ?

2010-05-28 Thread Robin Anil
--input could be misleading, if we dont specify what format the input file is in. Like DictionaryVectorizer needs Text,Text . Kmeans need Text, VW --input is ok if we can create a input tester which tests and throws error if the files are not in the required format. Cheaper than launching a map/r

Re: Mahout classifier error

2010-06-07 Thread Robin Anil
You need to set a lot more things in the BayesParameters See the TrainClassifier main function and set everything it does by default Robin On Mon, Jun 7, 2010 at 8:20 PM, JAGANADH G wrote: > Dear All > I was trying the Mahout classifier . > The program which I used for classifcation is given

Re: TU Berlin Summer of Code by Isabel Drost: Example Data Set for HMM

2010-06-12 Thread Robin Anil
There are pos tagging. Datasets for English they are good tests for hmm based sequence classification On Jun 12, 2010 11:47 PM, wrote: > Hello everybody, > > currently we are implementing HMM for Mahout and now are looking for > some example > data set. Could you guys recommend a data set to us?

Re: TU Berlin Summer of Code by Isabel Drost: Example Data Set for HMM

2010-06-12 Thread Robin Anil
http://flexcrfs.sourceforge.net/#Document_&_Source_Code The above framework is for conditional random fields. You should be getting around 95% precision using hmms sent from nexus one On Jun 12, 2010 11:53 PM, "Marc Hofer" wrote: I will take a look at it, how can I access the data set? Cheers,

Re: TU Berlin Summer of Code by Isabel Drost: Example Data Set for HMM

2010-06-12 Thread Robin Anil
There is a NP chunking dataset on that webpage you can try that instead and compare sent from nexus one On Jun 12, 2010 11:58 PM, "Robin Anil" wrote: http://flexcrfs.sourceforge.net/#Document_&_Source_Code The above framework is for conditional random fields. You should be get

Re: PFPGrowth on cluster does not distribute work load equally on nodes

2010-06-17 Thread Robin Anil
Hi Bjorn, The distribution of data is in a skewed manner. Thats a problem with the algorithm as proposed in the paper . The way around it is to increase the number of groups parameter. For example, if you have 10K unique features, try to split it into groups such that there is around 10 features p

Re: PFPGrowth on cluster does not distribute work load equally on nodes

2010-06-23 Thread Robin Anil
PFPGrowth job does not set the number of maps or reduces. The reason for maps being 1 could be due to the data being smaller. See, the Transaction sorting job converts data into integers and you might see a drastic reduction in dataset size(see output/sortedoutput to check the compressed data size)

Re: Rule-based classifier

2010-06-23 Thread Robin Anil
CBayes should work fine with any length text, give it a try. the training code runs without a hadoop cluster using the hadoop lib and classification can be done independent of hadoop Robin On Thu, Jun 24, 2010 at 11:43 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > hat finds comme

Re: Rule-based classifier

2010-06-23 Thread Robin Anil
be sure to select ngrams to atleast 2 to select blocks. More the merrier for short text Robin On Thu, Jun 24, 2010 at 11:46 AM, Robin Anil wrote: > CBayes should work fine with any length text, give it a try. the training > code runs without a hadoop cluster using the hadoop l

Re: [OT] Mahout expertise

2010-07-07 Thread Robin Anil
@Ankur ? On Wed, Jul 7, 2010 at 9:52 PM, Ted Dunning wrote: > Pity. I am in the San Francisco Bay area. Would love to help. > > Robin Anil is in India, but I think he is totally over-committed. > > On Wed, Jul 7, 2010 at 9:17 AM, tog wrote: > > > Hi, > > >

Re: Help with running clusterdump after running Dirichlet

2010-07-16 Thread Robin Anil
I am trying to run clusterdumper from trunk. seems like its not outputting anything. Need to investigate bin/mahout clusterdump -s reuters-clusters/cluster-6/part-r-0 -d reuters-vectors/dictionary.file-0 -dt sequencefile -n 10 -b 100 On Fri, Jul 16, 2010 at 7:08 AM, Jeff Eastman wrote: > Als

Re: Help with running clusterdump after running Dirichlet

2010-07-16 Thread Robin Anil
find one. This could obviously be improved; at least > an error message would be appropriate. I see it does not extend AbstractJob > either. I'll look into that next week. > > Jeff > > > > On 7/16/10 12:24 AM, Robin Anil wrote: > >> I am trying to run cluste

Cloudera HUE Opensourced

2010-07-19 Thread Robin Anil
http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-hue/

Re: Cloudera HUE Opensourced

2010-07-20 Thread Robin Anil
+100 sent from nexus one On Jul 19, 2010 9:19 PM, "Ted Dunning" wrote: > That would be great! > > On Mon, Jul 19, 2010 at 7:38 PM, Josh Patterson wrote: > >> From just a personal >> time perspective, I may try and mock up some demos for something like >> this. >>

Re: Cloudera HUE Opensourced

2010-07-21 Thread Robin Anil
On Mon, Jul 19, 2010 at 7:38 PM, Josh Patterson wrote: > (disclaimer: I work for Cloudera) > > As someone who loves the WEKA suite and UI stuff, I can't help but > think Hue makes an interesting choice of a platform to build something > similar for Mahout on, being open source and all. From just

Re: about the sourcecode of PFPGrowth

2010-07-28 Thread Robin Anil
This is not regular FPGrowth. This has many other super improvements ;) 1) Its a faster method for mining of Top K patterns for each unique item: How does it do it? First it makes the conditional tree for each feature in Bottom up manner(like the paper). Then it mines the conditional tree in top d

Re: about the sourcecode of PFPGrowth

2010-07-28 Thread Robin Anil
> But, i still have some questions. > > On Thu, Jul 29, 2010 at 6:37 AM, Robin Anil wrote: > > > This is not regular FPGrowth. This has many other super improvements ;) > > > > 1) Its a faster method for mining of Top K patterns for each unique item: > > How doe

Re: Clustering Questions

2010-08-16 Thread Robin Anil
Seems to me like a lack of memory error. Try increasing the heap size. Hadoop is throwing "out of mem" exception, which doesnt get propagated to the driver Robin On Tue, Aug 17, 2010 at 2:52 AM, Drew Farris wrote: > On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve > wrote: > > > 1. It a

Re: machine learning with Mahout

2010-08-18 Thread Robin Anil
On Wed, Aug 18, 2010 at 8:31 AM, Srivathsan Srinivas < srivathsan.srini...@gmail.com> wrote: > Hi Robin, > I am in the process of learning to use Mahout for machine learning > and want to compare 2 different classification algorithms on the same set of > data (say TwentyNewsGroups). Currentl

Re: Custom Input Format in New API (Convert Mahaout XMLInput Format to New API)

2010-08-24 Thread Robin Anil
+mahout-user On Tue, Aug 24, 2010 at 5:37 PM, Shuja Rehman wrote: > Hi > I am trying to convert Mahout xmlInputFormat to new API but this is not > working. The problem which i think is that in old api we have next method > which takes key and value and we can set it in the method > > public

Re: XmlInputFormat.java using new Hadoop APIs

2010-09-03 Thread Robin Anil
Not yet, take a stab! On Fri, Sep 3, 2010 at 9:12 PM, Hussain, Kamal (Kamal) < kamal.huss...@alcatel-lucent.com> wrote: > Hi there, > I was wondering if there is a version of XmlInputFormat.java which uses the > newer Hadoop APIs. The one I found here< > http://github.com/apache/mahout/blob/ad84

Re: XmlInputFormat.java using new Hadoop APIs

2010-09-03 Thread Robin Anil
Please send a patch changing the current XMLInputFormat code, see the wiki page on the style guidelines. We are in the process of changing all code to Hadoop 0.20 API On Fri, Sep 3, 2010 at 9:31 PM, Shuja Rehman wrote: > Hi > > Check out this link > > http://xmlandhadoop.blogspot.com/ > > On Fr

Re: Getting error in Training the classifier as in TwentyNewsgroup

2010-09-24 Thread Robin Anil
also use the mahout shell script bin/mahout trainclassifier bin/mahout test classifier using the same parameters. Robin -- Forwarded message -- From: Sean Owen Date: Fri, Sep 24, 2010 at 11:46 PM Subject: Re: Getting error in Training the classifier as in TwentyNewsgroup To: u

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Robin Anil
It does that by default for all words. What else do you have in mind? On Thu, Sep 30, 2010 at 8:07 PM, Neil Ghosh wrote: > Does anybody have examples/reference how to use TF-IDF weights in mahout > cbayes for particular words and phrases while doing text classification ? > > -- > Thanks and Rega

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Robin Anil
You may split the dataset in 80/20 or some other ratio and try. You can split them after you have created the data in Bayes classifier format or split it into different folders and make them as described in the documentation. Robin On Thu, Sep 30, 2010 at 7:30 PM, Neil Ghosh wrote: > Hi, > > I

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Robin Anil
et the output. If you need to add couple words hardcoded into the classifier. Add them as a training instance. Since features are assumed to be independent in bayes. it doesnt matter how you give them POSproblemcomplaintproblemo > > > On Thu, Sep 30, 2010 at 8:55 PM, Robin Anil wrote: &g

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Robin Anil
ing > > LABELproblemcomplaintproblemo > > Along with the usual training data in Bayes format ? > > > On Thu, Sep 30, 2010 at 9:44 PM, Robin Anil wrote: > >> >>>> Or Do I have flexibility to give some other input specific to my problem >>>> ? Such as if words like "Probl

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Robin Anil
On Thu, Sep 30, 2010 at 9:45 PM, Neil Ghosh wrote: > > Do you mean , I should 1st create the model with correct data in correct > folder (Label). > > Now you throw an instance at it and you will get the correct label, well most of the time.

Re: Getting error in Training the classifier as in TwentyNewsgroup

2010-09-30 Thread Robin Anil
Can I run without hadoop > using these scripts ? > > On Fri, Sep 24, 2010 at 11:59 PM, Robin Anil wrote: > > > also use the mahout shell script > > > > bin/mahout trainclassifier > > bin/mahout test classifier > > > > using the same parameter

Re: unknown test data twenty-newsgroups example

2010-10-01 Thread Robin Anil
> Let me list what I understood. Pl confirm if I got it correct? > > Add duplicate extra lines many times in an extra file (conforming to the > format required by the Bayes Classifier) in the format > > If I want to increase the weight of word1 and word2, so that text with > those words have high

Re: How to get multi-language support for training/classifying text into classes through Mahout?

2010-10-02 Thread Robin Anil
Classifier supports non english tokens(its assumes string is Utf8 encoded) Robin On Sat, Oct 2, 2010 at 9:16 PM, Bhaskar Ghosh wrote: > Dear All, > > I have a requirement where I need to classify text in a non-English > language. I > have heard that Mahout supports multi-language. Can anyone p

Re: Mahout Classifier

2010-10-04 Thread Robin Anil
Could you see what the scores for the two classes are coming as? Robin On Mon, Oct 4, 2010 at 12:41 PM, JAGANADH G wrote: > Dear All > I tried the Mahout Naive Bayes Classifier with 1300 'Good' document and > 1300 > 'Bad' documents > The code which I used for training is given below. > > publ

Re: Querry regarding use of classifier in Mahout

2010-10-18 Thread Robin Anil
Let me take a look. I will let you know. How was the preprocessing done? Could you enumerate the steps you followed. On Mon, Oct 18, 2010 at 8:37 PM, JAGANADH G wrote: > Dear All > I am trying to implement classifier algo used in Mahout for a sample > project. > > I tried both NaiveBayesClassife

Re: Querry regarding use of classifier in Mahout

2010-10-18 Thread Robin Anil
Correctly Classified Instances : 1702 85.1% Incorrectly Classified Instances:298 14.9% Total Classified Instances : 2000 === Confusion Matrix

Re: Querry regarding use of classifier in Mahout

2010-10-18 Thread Robin Anil
bin/mahout prepare20newsgroups -p /Users/robinanil/Downloads/movie_reviews/ -o movie -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer bin/mahout trainclassifier -i movie/ -o movie-model -type cbayes -a 1.0 bin/mahout testclassifier -d movie -m movie-model/ -type bayes -default unknown

Re: Querry regarding use of classifier in Mahout

2010-10-18 Thread Robin Anil
Just pushed a bug fix for ngrams. Update your copy. Here is the result with ngram = 2 Correctly Classified Instances : 1995 99.75% Incorrectly Classified Instances: 5 0.25% Total Classified Instances : 2000 ==

Re: Querry regarding use of classifier in Mahout

2010-10-18 Thread Robin Anil
, 2010 at 10:14 AM, Robin Anil wrote: > > > Just pushed a bug fix for ngrams. Update your copy. Here is the result > with > > ngram = 2 > > > > Correctly Classified Instances : 1995 99.75% > > Incorrectly Classified Instances

Re: Vector in Mahout

2010-10-24 Thread Robin Anil
Vector(a map of int to doubles, in simple sense, can also be implemented as array or double arrays as well) is serialized into a binary format. See AbstractVector.java deep inside math directory to know how it is read and written as bytes. On Mon, Oct 25, 2010 at 11:17 AM, Divya wrote: > Hi, >

Re: Mahout dependencies on windows

2010-10-24 Thread Robin Anil
Hadoop is not supported on windows, and Mahout is written completely on top of Hadoop libraries. So we can't help you there. Maybe someone on this list may have experience with hacking Mahout to work on windows On Mon, Oct 25, 2010 at 11:34 AM, Divya wrote: > Hi, > > Is it must to install cygwi

Re: Reading Vectors

2010-10-26 Thread Robin Anil
VectorWritable has both reading and writing functions On Tue, Oct 26, 2010 at 12:29 PM, Lance Norskog wrote: > I found the Vector Writer utilities. Where are the matching readers? > > -- > Lance Norskog > goks...@gmail.com >

Fwd: running mahout with lucene.vector produces a dictionary output only

2010-10-28 Thread Robin Anil
-- Forwarded message -- From: Mackram Date: Thu, Oct 28, 2010 at 7:06 PM Subject: running mahout with lucene.vector produces a dictionary output only To: gene...@lucene.apache.org Hey everyone, I have a simple question to ask and hopefully someone can point me in the right dire

  1   2   >