Re: Spec for a common import/export service for Mahout jobs

2011-09-12 Thread Sean Owen
I think we discussed several of these points on the mailing list. I am not sure I would ever expect there to be a common format across all jobs. They just don't all operate on the same information. Even where two jobs ingest vectors, it doesn't mean vectors for one are meaningful for another. If

#clojure #fkmeans - Clustering of Test Data Failed

2011-09-12 Thread Jeffrey
Hi, I have a test data that has a number of points, written to a sequence file using a Clojure script as follows (I am equally just as bad in both JAVA and Clojure, since I really don't like JAVA I wrote my scripts in Clojure whenever possible).     #!./bin/clj     (ns sensei.sequence.core)  

Re: #clojure #fkmeans - Clustering of Test Data Failed

2011-09-12 Thread Danny Bickson
Hi Jeffery! I have encountered this problem as well. The workaround, is to run one iteration of k-means, to create initial cluster assignment and then run fuzzy k-means using the output from the first iteration of k-means. Hope this helps, Danny Bickson On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey

Re: #clojure #fkmeans - Clustering of Test Data Failed

2011-09-12 Thread Choon-Siang Jeffrey04 Lai
Hi Danny, I have read a small portion of the source code, for variation 1, an initial cluster will be generated using RandomSeedGenerator if there is none found in the path so I don't have to do the initial cluster myself. For variation 2, I actually have generated the initial cluster using

[Announcement] SearchWorkings.org is live!

2011-09-12 Thread Frank Scholten
Hi all, This is an announcement of the community site SearchWorkings.org [1] SearchWorkings.org offers search professionals a point of contact or comprehensive resource to learn and discuss all the new developments in the world of open source search and related subjects like Mahout and Hadoop.

SGD/SVM classification : minimum dataset size for training

2011-09-12 Thread Loic Descotte
Hi all, My classification problem is very similar to the 20 newsgroups example. But I don't have the possibility to use a large quantity of data for training. I'd like to know what would be the minimum size of training data for SGD or SVM algorithms to have reasonable results. My datas

Re: SGD vs Naive Bayes for classification

2011-09-12 Thread Ted Dunning
Hard to say and certainly not without substantial amounts of testing. The guy who did it seems pretty solid, but it never has been tested by anybody for production use. On Mon, Sep 12, 2011 at 12:54 AM, Loic Descotte loic.desco...@kelkoo.comwrote: Mahout in Action is saying that SVM has been

Re: SGD vs Naive Bayes for classification

2011-09-12 Thread Zach Richardson
I haven't played with the one in Mahout. From what I understand they wrapped either Liblinear or Libsvm, so you should get comprobable results from that implementation as using Libsvm from the command line or embedded in Rapidminer or Weka. On Mon, Sep 12, 2011 at 9:17 AM, Ted Dunning

Re: how to use ga.watchmaker

2011-09-12 Thread deneche abdelhakim
Mahout's GA is a utility class that allows a genetic algorithm written using Watchmaker to distribute the fitness computation. The examples are actually part of Mahout distribution so you can take a look at them. Please note that a good understanding of Watchmaker is required, but it's actually a

Re: SGD/SVM classification : minimum dataset size for training

2011-09-12 Thread Ted Dunning
SVM is reasonable. SGD with hand-tuning of the learning parameters may work. With so little training data, you will have a difficult assessing whether your system is working. Sometimes, you can rephrase your problem so that all of your training data across many situations can be pooled

Build problems using Eclipse/MVN

2011-09-12 Thread Ramesh Nallapati
Hi, I am a new user to Mahout as well as to Maven. I downloaded Mahout through the svn repository and I am trying to install it on my Mac running the latest Lion OS. I read the instructions at https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout and followed all the steps until

Re: SGD vs Naive Bayes for classification

2011-09-12 Thread Ted Dunning
They actually ported the liblinear algorithm so you should get comparable results unless there are bugs. Early tests looked good, but those are just that. On Mon, Sep 12, 2011 at 2:32 PM, Zach Richardson z...@raveldata.com wrote: I haven't played with the one in Mahout. From what I understand

Re: Spec for a common import/export service for Mahout jobs

2011-09-12 Thread Lance Norskog
I am not sure I would ever expect there to be a common format across all jobs. They just don't all operate on the same information. Even where two jobs ingest vectors, it doesn't mean vectors for one are meaningful for another. Machine learning has quite a few algorithms where data is