Re: Connection Pooling
You can ignore it. It just doesn't know for sure you have a pool. I believe I have even removed this in a recent refactoring. On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote: So I keep getting this warning from either Mahout or the server (I'm guessing the former): WARNING: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced. I'm not really sure why this is happening. I have the following resource in my webapp's context.xml file. Is there anything else I need to do enable connection pooling with a JNDI resource? Resource name=jdbc/offline-local auth=Container type=javax.sql.DataSource username=root password= driverClassName=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/offlinedevel?autoReconnect=trueamp;cachePreparedStatements=trueamp;cachePrepStmts=trueamp;cacheResultSetMetadata=trueamp;alwaysSendSetIsolation=falseamp;elideSetAutoCommits=true validationQuery=select 1 maxActive=16 maxIdle=4 removeAbandoned=true logAbandoned=true / Thanks in advance. -Salil
Re: Plagiarism - document similarity
Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in real world scenario possibly now ;) . Alternatively i need a basic text(book) to start reading and arrive to understand what you are saying. thanks again On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote: Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space The classic is Smith Waterman: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm This approach been used in general text processing tasks too, e.g.: http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf given the funds they receive ;), Hah! Less so these days I'm afraid :-) Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Plagiarism - document similarity
Hi Luca, again, I have to emphasize read what I gave you. The algorithm in my link was explained for non-scientists and if you are going to download Solr you will find the class to have a look on how they implemented that algorithm. More easy would mean that someone else is writing the code for you ;). Regards, Em Am 12.07.2011 09:58, schrieb Luca Natti: Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in real world scenario possibly now ;) . Alternatively i need a basic text(book) to start reading and arrive to understand what you are saying. thanks again On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote: Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space The classic is Smith Waterman: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm This approach been used in general text processing tasks too, e.g.: http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf given the funds they receive ;), Hah! Less so these days I'm afraid :-) Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
What's the accuracy of random forests in Mahout?
Hi, When the training data set can be loaded into memory, or each split can be, what's accuracy of the decision forest algorithm, compared with LogisticRegression. Do you have production usages with random forest? Regards, Xiaobo Gu
File format question about Random forest.
Hi, The Random Forest partial implementation in https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation use the ARFF file format, is ARFF the only supportted file format when using the BuildForest and TestForest program, and are BuildForest and TestForest program are official tools to build Random Forest models from the command line? Regards, Xiaobo Gu
Re: Using tf-idf vectors to train Naive Bayes
Which version of naivebayes are you using? bayes.* package or naivebayes.* ? Former uses text input. Latter one uses vectors. On Tue, Jul 12, 2011 at 7:59 PM, kevin_ravel ke...@raveldata.com wrote: I'm a little confused as to the proper way to format the data for training a naive bayes classifier. Is it possible to give the classifier tfidf-vectors generated using the results from seq2sparse? I have arranged it so that I have a sequence file where the key is the target variable and the value is a tfidf vector. When I use this as the input to trainclassifier I get the following error: Running on hadoop, using HADOOP_HOME=/home/kevin/Hadoop/hadoop-0.20.2/ No HADOOP_CONF_DIR set, using /home/kevin/Hadoop/hadoop-0.20.2//src/conf 11/07/12 09:27:13 WARN driver.MahoutDriver: No trainclassifier.props found on classpath, will use command-line arguments only 11/07/12 09:27:13 INFO bayes.TrainClassifier: Training Bayes Classifier 11/07/12 09:27:13 INFO bayes.BayesDriver: Reading features... 11/07/12 09:27:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/07/12 09:27:14 INFO mapred.FileInputFormat: Total input paths to process : 1 11/07/12 09:27:14 INFO mapred.JobClient: Running job: job_201107120921_0001 11/07/12 09:27:15 INFO mapred.JobClient: map 0% reduce 0% 11/07/12 09:27:24 INFO mapred.JobClient: Task Id : attempt_201107120921_0001_m_00_0, Status : FAILED java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Using-tf-idf-vectors-to-train-Naive-Bayes-tp3162590p3162590.html Sent from the Mahout User List mailing list archive at Nabble.com.
Re: What's the accuracy of random forests in Mahout?
I don't believe that Mahout's random forests have been used in production. I have heard that some people got pretty good results in testing. On Tue, Jul 12, 2011 at 6:03 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: Hi, When the training data set can be loaded into memory, or each split can be, what's accuracy of the decision forest algorithm, compared with LogisticRegression. Do you have production usages with random forest? Regards, Xiaobo Gu
ItemSimilarity pre-processing
Hi all, I am new to Mahout and I am putting up a Recommender for buddycloud ( http://buddycloud.com/) as a part of my GSoC project ( https://github.com/buddycloud/channel-directory). In the testing snapshot, I got ~100k users, ~20k items and ~230k boolean taste preferences. At first I tried an UserBasedRecommender, with an all-in-memory DataModel (read from dump file, created a GenericDataModel). The recommendations performed great, almost real time. However, I thought this strategy wouldn't scale, once the number of users and items tend to increase, and then the service could run out-of-memory. Then I tried a PostgreSQLBooleanPrefJDBCDataModel, and, as expected, the performance dropped drastically. After reading the blog post at http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/, I decided to try an ItemBasedRecommender, using a preprocessed ItemSimilarity table. I am trying to not use MapReduce at first, thus I tried to compute the LogLikehood similarity from every pair of item. This took too long, and then I gave up. Finally, my questions are: Am I doing things right? What is the best way to compute item similarity offline without MapReduce? Thanks in advance! Abmar -- Abmar Barros MSc candidate on Computer Science at Federal University of Campina Grande - www.ufcg.edu.br OurGrid Team Member - www.ourgrid.org Paraíba - Brazil
Random Forest feature types
From what I can see, the random forest implementation takes either numerical or categorical feature data. That worked fine for me, until I tried to incorporate word or text features. I liked the encoders used in SGD, but they don't seem to apply to random forests. So, did I overlook something simple that would allow me to include word or text features? If not, are there plans (assuming the core algorithm allows) to add these feature types to random forests in the future? Thanks, Don Pazel
Re: combination of features worsen the performance
Hi Ted, Thanks very much for your very detailed reply. It is very helpful. still some questions. I hope i am not polluting this email list much.. I understand all your comments except below: Finally, you should be combining group ranking objective as well as regression objectives. Otherwise, your model will simply be learning which users are likely to click on anything and those users who will never click on anything. There are provisions for segmented AUC in the code, but that will only work for binary targets. In general, it is common to build cascaded models to deal with this. The first model learns to predict click and the cascaded model learns conversion conditional on click. We can use binary targets; that shouldn't be a problem. Could you say a little more about segmented AUC? also about the cascaded models? Do you have an reference papers/book/codesSamples/example projects for recommendation? I have the mahout in action book, but seems i didn't see stuff like that... Thanks again for your help.. -Weihua On Jul 11, 2011, at 3:30 PM, Ted Dunning wrote: There are lots of problems with the problem as posed. I am not surprised with poor results. You should not downsample negative examples so severely. I would keep as many as 10-30 x as many positive examples you have. Even then, I suspect you don't have enough data especially if you have already included data for all of your models. Your Feature A is not useful unless you are putting all ad results together. Even then, you need to include more advertiser, campaign and ad specific features. The feature vector size of 10,000 is actually relatively small if you have any reasonable degree of sparsity in your user and ad features. Unused features do not hurt learning. Finally, you should be combining group ranking objective as well as regression objectives. Otherwise, your model will simply be learning which users are likely to click on anything and those users who will never click on anything. There are provisions for segmented AUC in the code, but that will only work for binary targets. In general, it is common to build cascaded models to deal with this. The first model learns to predict click and the cascaded model learns conversion conditional on click. Most importantly, really, I would recommend that you experiment with model design using a system like R so that you can get fast turn-around on modeling efforts. On Mon, Jul 11, 2011 at 3:04 PM, Weihua Zhu w...@adconion.com wrote: hi Thanks Ted. I understand that the training dataset size is small. The reason is that we have very limited number of action class events/instances. We also want to make each target class have equal number of events/instances. Feature A is the advertisement campaign ID, and Feature B is the behaviors that internet user has, for example, gender:male, country: us, etc. I set the size of the encoder to 1, which is very large. I used this setup for OnlineLogisticRegressioN: olr = new OnlineLogisticRegression(3, FEATURES, new L1()); olr.alpha(1).stepOffset(1000).lambda(3e-5).learningRate(3); Thanks. -wz On Jul 11, 2011, at 2:49 PM, Ted Dunning wrote: This is a tiny amount of data. The regularization in Mahout's SGD implementation is probably not as effective as second order techniques for such tiny data. Btw... you didn't answer my questions about what kind of data feature A and B are. I understand that you might be shy about this, but without that kind of information, I can't help you. (and add this additional question) What is the size of the encoded vector? On Mon, Jul 11, 2011 at 2:26 PM, Weihua Zhu w...@adconion.com wrote: Target class is if a user click an ad(advertisement), buy through an ad, or not; so 3 classes. Feature A s about the Advertisement itself; Feature B is about the user's behaviors; Currently im only using feature A and B. Total training data is 250 for each class; thanks.. From: Ted Dunning [ted.dunn...@gmail.com] Sent: Monday, July 11, 2011 2:15 PM To: user@mahout.apache.org Subject: Re: combination of features worsen the performance Can you say a little bit about the data? What are features A and B? What kind of data do they represent? How many other features are there? What is the target variable? How many possible values does it have? How much training data do you have? What sort of training are you doing? On Mon, Jul 11, 2011 at 2:08 PM, Weihua Zhu w...@adconion.com wrote: Hi, Dear all, I am using mahout logistic regression for classification; interestingly, for feature A, B, individually each has satisfactory performances, say 65%, 80%, but when i combine them together(using encoder), the performance is like 72%. Shouldn't the performance be better? Any thoughts? Thanks a lot, -wz.
Re: combination of features worsen the performance
thanks. We are trying to get larger dataset. probably over 2000 for each class. what do you mean by the errors on performance estimates? the confusion matrix? On Jul 11, 2011, at 2:44 PM, Konstantin Shmakov wrote: It seems that training data set is way too small. What are the errors on performance estimates? -- On Mon, Jul 11, 2011 at 2:26 PM, Weihua Zhu w...@adconion.com wrote: Target class is if a user click an ad(advertisement), buy through an ad, or not; so 3 classes. Feature A s about the Advertisement itself; Feature B is about the user's behaviors; Currently im only using feature A and B. Total training data is 250 for each class; thanks.. From: Ted Dunning [ted.dunn...@gmail.com] Sent: Monday, July 11, 2011 2:15 PM To: user@mahout.apache.org Subject: Re: combination of features worsen the performance Can you say a little bit about the data? What are features A and B? What kind of data do they represent? How many other features are there? What is the target variable? How many possible values does it have? How much training data do you have? What sort of training are you doing? On Mon, Jul 11, 2011 at 2:08 PM, Weihua Zhu w...@adconion.com wrote: Hi, Dear all, I am using mahout logistic regression for classification; interestingly, for feature A, B, individually each has satisfactory performances, say 65%, 80%, but when i combine them together(using encoder), the performance is like 72%. Shouldn't the performance be better? Any thoughts? Thanks a lot, -wz. -- ksh:
Re: Connection Pooling
Oh yea, at runtime, I'm getting back a BasicDataSource object for my DataSource. Is that correct? On Tue, Jul 12, 2011 at 9:59 PM, Salil Apte sa...@offlinelabs.com wrote: So I started actually looking at performance today and it is pretty horrendous. I've got about 61,000 rows in my database which I'm assuming isn't *that* many rows. But recommendations are taking 20 seconds. Is there some way to ensure pooling is turned on? What else is a big driver for performance? My tables are setup so that I have a multiple index (for uniqueness) for user_id, item_id pairs. That way, there cannot be two entries with the same user_id, item_id. I'm not sure where to go from here. Thanks for the help! On Tue, Jul 12, 2011 at 12:47 AM, Sean Owen sro...@gmail.com wrote: You can ignore it. It just doesn't know for sure you have a pool. I believe I have even removed this in a recent refactoring. On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote: So I keep getting this warning from either Mahout or the server (I'm guessing the former): WARNING: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced. I'm not really sure why this is happening. I have the following resource in my webapp's context.xml file. Is there anything else I need to do enable connection pooling with a JNDI resource? Resource name=jdbc/offline-local auth=Container type=javax.sql.DataSource username=root password= driverClassName=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/offlinedevel?autoReconnect=trueamp;cachePreparedStatements=trueamp;cachePrepStmts=trueamp;cacheResultSetMetadata=trueamp;alwaysSendSetIsolation=falseamp;elideSetAutoCommits=true validationQuery=select 1 maxActive=16 maxIdle=4 removeAbandoned=true logAbandoned=true / Thanks in advance. -Salil