Re: Connection Pooling

2011-07-12 Thread Sean Owen
You can ignore it. It just doesn't know for sure you have a pool. I believe I have even removed this in a recent refactoring. On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote: So I keep getting this warning from either Mahout or the server (I'm guessing the former):

Re: Plagiarism - document similarity

2011-07-12 Thread Luca Natti
Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in

Re: Plagiarism - document similarity

2011-07-12 Thread Em
Hi Luca, again, I have to emphasize read what I gave you. The algorithm in my link was explained for non-scientists and if you are going to download Solr you will find the class to have a look on how they implemented that algorithm. More easy would mean that someone else is writing the code for

What's the accuracy of random forests in Mahout?

2011-07-12 Thread Xiaobo Gu
Hi, When the training data set can be loaded into memory, or each split can be, what's accuracy of the decision forest algorithm, compared with LogisticRegression. Do you have production usages with random forest? Regards, Xiaobo Gu

File format question about Random forest.

2011-07-12 Thread Xiaobo Gu
Hi, The Random Forest partial implementation in https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation use the ARFF file format, is ARFF the only supportted file format when using the BuildForest and TestForest program, and are BuildForest and TestForest program are official

Re: Using tf-idf vectors to train Naive Bayes

2011-07-12 Thread Robin Anil
Which version of naivebayes are you using? bayes.* package or naivebayes.* ? Former uses text input. Latter one uses vectors. On Tue, Jul 12, 2011 at 7:59 PM, kevin_ravel ke...@raveldata.com wrote: I'm a little confused as to the proper way to format the data for training a naive bayes

Re: What's the accuracy of random forests in Mahout?

2011-07-12 Thread Ted Dunning
I don't believe that Mahout's random forests have been used in production. I have heard that some people got pretty good results in testing. On Tue, Jul 12, 2011 at 6:03 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: Hi, When the training data set can be loaded into memory, or each split can

ItemSimilarity pre-processing

2011-07-12 Thread Abmar Barros
Hi all, I am new to Mahout and I am putting up a Recommender for buddycloud ( http://buddycloud.com/) as a part of my GSoC project ( https://github.com/buddycloud/channel-directory). In the testing snapshot, I got ~100k users, ~20k items and ~230k boolean taste preferences. At first I tried an

Random Forest feature types

2011-07-12 Thread Don Pazel
From what I can see, the random forest implementation takes either numerical or categorical feature data. That worked fine for me, until I tried to incorporate word or text features. I liked the encoders used in SGD, but they don't seem to apply to random forests. So, did I overlook

Re: combination of features worsen the performance

2011-07-12 Thread Weihua Zhu
Hi Ted, Thanks very much for your very detailed reply. It is very helpful. still some questions. I hope i am not polluting this email list much.. I understand all your comments except below: Finally, you should be combining group ranking objective as well as regression objectives.

Re: combination of features worsen the performance

2011-07-12 Thread Weihua Zhu
thanks. We are trying to get larger dataset. probably over 2000 for each class. what do you mean by the errors on performance estimates? the confusion matrix? On Jul 11, 2011, at 2:44 PM, Konstantin Shmakov wrote: It seems that training data set is way too small. What are the errors on

Re: Connection Pooling

2011-07-12 Thread Salil Apte
Oh yea, at runtime, I'm getting back a BasicDataSource object for my DataSource. Is that correct? On Tue, Jul 12, 2011 at 9:59 PM, Salil Apte sa...@offlinelabs.com wrote: So I started actually looking at performance today and it is pretty horrendous. I've got about 61,000 rows in my database