Re: New User to Mahout

2011-11-18 Thread Ioan Eugen Stan
Pe 12.11.2011 15:52, Frank Scholten a scris: Hi Sachin, Most Mahout jobs have several overloaded run methods. For example: KMeansDriver.run(configuration, input, clustersIn, output, measure, convergenceDelta, maxIterations, runClustering, runSequential) Also, most of them extend AbstractJob

Re: Problem compiling mahout

2011-11-18 Thread Sean Owen
Oh I think you're right, looking at the Hadoop source. It is using chmod, not a Java API. Yes I understand how chmod and cygwin work. cygwin intercepts all path resolution logic in the OS and rewrites it. It may well be an issue with finding the chmod command in the end, but, I think it ought to wo

lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

2011-11-18 Thread Sean Owen
Sebastian do you have any thoughts on the right starting value for lambda, the overfitting param in your ALS-based implementation? Yes I'm looking at the same Koren paper you had mentioned. I don't have a good sense of whether the loss from that extra term is supposed to be "much more important",

Re: lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

2011-11-18 Thread Sebastian Schelter
The right value for lambda depends on the data and the confidence function and should be chosen via cross-validation. Coincidentally, I'm currently watching lecture X (10) of http://ml-class.org which exactly talks about ways to choose the regularization parameter :) --sebastian On 18.11.2011 1

Re: clustering hardware requirements

2011-11-18 Thread Grant Ingersoll
On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote: > Hello, > > I have to figure out how much hardware is required to do clustering > for my company on about 10+ milion user accounts, each with 100-5000 > documents. The documents will be indexed so vector creation will be > done at indexing. >

Re: clustering hardware requirements

2011-11-18 Thread Ted Dunning
It is a great idea except that the centroids become harder to interpret. Not much harder. Just a bit harder. On Fri, Nov 18, 2011 at 9:44 AM, Grant Ingersoll wrote: > I haven't explored yet what it would mean to use Encoded vectors in > Clustering, but perhaps I can call Ted to the front of the

Re: Problem compiling mahout

2011-11-18 Thread Lance Norskog
I had to do the above to make it work. Why the Hadoop code can't gracefully accept a failure for this one call, I don't understand. If access fails, the code should give a 'permission denied' error. Calling chmod should just be a silent helper, not a deal-killer. On Fri, Nov 18, 2011 at 12:59 AM,

Re: Wiki edit request

2011-11-18 Thread Isabel Drost
On 12.11.2011 Lance Norskog wrote: > Also, please add a link to this- I'm not sure where. > > https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats Thanks for the pages - added the links there: https://cwiki.apache.org/confluence/display/MAHOUT/Developer+Resources Anyone mind if you

Re: New User to Mahout

2011-11-18 Thread Isabel Drost
On 12.11.2011 thinkingbigdata wrote: > I want to understand it fully and want coding to be done in Java. If anyone > can help me with some examples code that is using Hadoop written examples > that would be really helpful. Do you have any machine learning problem you want to get started with in p

Re: Coding format update: Eclipse Lucene conventions

2011-11-18 Thread Isabel Drost
On 14.11.2011 Lance Norskog wrote: > The Eclipse Lucene conventions are mighty close to what we're using, much > more so that the Eclipse formatting file on the "How To Contribute" page. > So, I've uploaded the Lucene file and changed the link. Eclipse users, > please try it and see if it's what we

Large Scale Clustering

2011-11-18 Thread Grant Ingersoll
Might be of interest: "Clustering Very Large Multi-dimensional Datasets with MapReduce" http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf Grant Ingersoll http://www.lucidimagination.com

Re: trainclassifier as a command vs. TrainClassifier.java

2011-11-18 Thread Isabel Drost
On 15.11.2011 Sam Cunningham wrote: > 2. However, when I run TrainClassifier.java program with source set to hdfs > and input and output set to locations on my local fs, it accepts the > arguments with no complaints and generates the model on my local fs > (instead of hdfs). > > In addition, the m

Re: mahout for enterprise search project

2011-11-18 Thread Isabel Drost
On 15.11.2011 Burcu Buyukkagnici wrote: > Where does mahout; Lucene/solr and UIMA framework fit in the following > scenario? Some more background on how search and machine learning fit together see also http://www.manning.com/ingersoll/ Also at the latest ApacheConNA Grant provided some ideas an

Re: Large Scale Clustering

2011-11-18 Thread Dawid Weiss
Thanks Grant, I'll definitely check this out. D. On Fri, Nov 18, 2011 at 9:52 PM, Grant Ingersoll wrote: > Might be of interest: "Clustering Very Large Multi-dimensional Datasets > with > MapReduce" > > http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf > > > ---

Re: Documentation

2011-11-18 Thread Isabel Drost
On 16.11.2011 Ted Dunning wrote: > One thing that you can do is to point out the problems and even suggest or > provide some improvements. Your eyes are still new and thus will see > problems more clearly than ours. One thing to note: Most of the Mahout documentation is online in our wiki - that

Re: Austin Hacker Dojo - Big Data Machine Learning

2011-11-18 Thread Isabel Drost
On 17.11.2011 David Boney wrote: > If at least > three or four people are interested we can have an organization meeting to > discuss the group name, finding a location to meet, development > environment, setting up a web site, and the agenda for the first couple of > months. Just a brief comment:

Re: Large Scale Clustering

2011-11-18 Thread Isabel Drost
On 18.11.2011 Grant Ingersoll wrote: > Might be of interest: "Clustering Very Large Multi-dimensional Datasets > with MapReduce" > > http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf Judging from the abstract it looks interesting indeed. Thanks for sharing, Grant. Isabel signature

Re: lambda overfitting param and ParallelALSFactorizationJob -- suggested value?

2011-11-18 Thread Dmitriy Lyubimov
in my experience, 'much less important' is probably the more accurate description out of 3. noisy data will usual result in crossvalidation optimum with higher lambda values and vice versa. (speaking from SGD experience and my specific data). in general case, you could probably try to infer it via

Re: Wiki edit request

2011-11-18 Thread Lance Norskog
> Anyone mind if you link to that more general documentation entry page from our front page instead of to the JavaDocs? I am not quite following. Whatever makes sense. You're right, Developer Resources is the right place for detailed documentation. On Fri, Nov 18, 2011 at 12:41 PM, Isabel Drost

Re: Wiki edit request

2011-11-18 Thread Dan Beaulieu
While on the topic, the hudson url is broken... Don't know what it should be... Dan On Fri, Nov 18, 2011 at 8:51 PM, Lance Norskog wrote: > > Anyone mind if you link to that more general documentation entry page > from our > front page instead of to the JavaDocs? > > I am not quite following. W

Re: Wiki edit request

2011-11-18 Thread Lance Norskog
Fixed. A: it moved, and B: it's "Jenkins" now. On Fri, Nov 18, 2011 at 6:02 PM, Dan Beaulieu wrote: > While on the topic, the hudson url is broken... Don't know what it should > be... > > Dan > > On Fri, Nov 18, 2011 at 8:51 PM, Lance Norskog wrote: > > > > Anyone mind if you link to that more g

Error in executing mahout kmeans

2011-11-18 Thread DIPESH KUMAR SINGH
Hi, I was trying to execute sample kmeans in mahout on reuters dataset to get myself started with mahout. After creating the sequence files, i got the following error. I am able to execute other map-reduce programs like wordcount on my hadoop cluster. I am unable to figure how to include these m

Mahout: NB Model for Text Classification - In Sample Error

2011-11-18 Thread Night Wolf
Hey all, Quick question regarding potential source of in-sample bias for a text classification project. I'm develop a system which reads text messages (i.e. SMS) and tries to classify them into a number of categories. We have a few million messages. We built our training set of a spare window (~2

Re: Mahout: NB Model for Text Classification - In Sample Error

2011-11-18 Thread Ted Dunning
This test plan is pretty reasonable. There is inherently going to be some form of bias due to the time shift, but the bias is real and will affect your test results the same way it will affect your operational accuracy. It might be somewhat interesting to estimate the effect over time by also tes