Re: Connection Pooling

2011-07-12 Thread Sean Owen
You can ignore it. It just doesn't know for sure you have a pool.
I believe I have even removed this in a recent refactoring.

On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote:

 So I keep getting this warning from either Mahout or the server (I'm
 guessing the former):

 WARNING: You are not using ConnectionPoolDataSource. Make sure your
 DataSource pools connections to the database itself, or database
 performance will be severely reduced.

 I'm not really sure why this is happening. I have the following
 resource in my webapp's context.xml file. Is there anything else I
 need to do enable connection pooling with a  JNDI resource?

 Resource name=jdbc/offline-local auth=Container
 type=javax.sql.DataSource username=root password=
 driverClassName=com.mysql.jdbc.Driver

 url=jdbc:mysql://localhost:3306/offlinedevel?autoReconnect=trueamp;cachePreparedStatements=trueamp;cachePrepStmts=trueamp;cacheResultSetMetadata=trueamp;alwaysSendSetIsolation=falseamp;elideSetAutoCommits=true
 validationQuery=select 1 maxActive=16 maxIdle=4
 removeAbandoned=true logAbandoned=true /

 Thanks in advance.

 -Salil



Re: Plagiarism - document similarity

2011-07-12 Thread Luca Natti
Thanks to all ,

i need to start from the beginning theory ,
you are speaking arab :) to me, or in other words i need
a less theoretical approach, or in other words some real code to put my
hands on.
Excuse this raw approach but i need a real fast to implement and understand
algorithm
to use in real world scenario possibly now ;) .
Alternatively i need a basic text(book) to start reading and arrive to
understand what you are saying.

thanks again

On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Easier to simply index all, say, three word phrases and use a TF-IDF score.
  This will give you a good proxy for sequence similarity.  Documents should
 either be chopped on paragraph boundaries to have a roughly constant length
 or the score should not be normalized by document length.

 Log likelihood ratio (LLR) test can be useful to extract good query
 features
 from the subject document.  TF-IDF score is a reasonable proxy for this
 although it does lead to some problems.  The reason TF-IDF works as a query
 term selection method and why it fails can be seen from the fact that
 TF-IDF
 is very close to one of the most important terms in the LLR score.

 On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg 
 andrew.clegg+mah...@gmail.com
  wrote:

  On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:
   - bioinformatics, in particular gene sequencing to detect long
   near-matching sequences (a variation of the above, I'm not familiar
   with any particular algorithms, but I imagine this is a well explored
   space
 
  The classic is Smith  Waterman:
 
  http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
 
  This approach been used in general text processing tasks too, e.g.:
 
 
 
 http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf
 
   given the funds they receive ;),
 
  Hah! Less so these days I'm afraid :-)
 
  Andrew.
 
  --
 
  http://tinyurl.com/andrew-clegg-linkedin |
 http://twitter.com/andrew_clegg
 



Re: Plagiarism - document similarity

2011-07-12 Thread Em
Hi Luca,

again, I have to emphasize read what I gave you.
The algorithm in my link was explained for non-scientists and if you are
going to download Solr you will find the class to have a look on how
they implemented that algorithm.

More easy would mean that someone else is writing the code for you ;).

Regards,
Em

Am 12.07.2011 09:58, schrieb Luca Natti:
 Thanks to all ,
 
 i need to start from the beginning theory ,
 you are speaking arab :) to me, or in other words i need
 a less theoretical approach, or in other words some real code to put my
 hands on.
 Excuse this raw approach but i need a real fast to implement and understand
 algorithm
 to use in real world scenario possibly now ;) .
 Alternatively i need a basic text(book) to start reading and arrive to
 understand what you are saying.
 
 thanks again
 
 On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Easier to simply index all, say, three word phrases and use a TF-IDF score.
  This will give you a good proxy for sequence similarity.  Documents should
 either be chopped on paragraph boundaries to have a roughly constant length
 or the score should not be normalized by document length.

 Log likelihood ratio (LLR) test can be useful to extract good query
 features
 from the subject document.  TF-IDF score is a reasonable proxy for this
 although it does lead to some problems.  The reason TF-IDF works as a query
 term selection method and why it fails can be seen from the fact that
 TF-IDF
 is very close to one of the most important terms in the LLR score.

 On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg 
 andrew.clegg+mah...@gmail.com
 wrote:

 On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:
 - bioinformatics, in particular gene sequencing to detect long
 near-matching sequences (a variation of the above, I'm not familiar
 with any particular algorithms, but I imagine this is a well explored
 space

 The classic is Smith  Waterman:

 http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

 This approach been used in general text processing tasks too, e.g.:



 http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf

 given the funds they receive ;),

 Hah! Less so these days I'm afraid :-)

 Andrew.

 --

 http://tinyurl.com/andrew-clegg-linkedin |
 http://twitter.com/andrew_clegg


 


What's the accuracy of random forests in Mahout?

2011-07-12 Thread Xiaobo Gu
Hi,

When the training data set can be loaded into memory, or each split
can be, what's accuracy of the decision forest algorithm, compared
with LogisticRegression. Do you have production usages with random
forest?

Regards,

Xiaobo Gu


File format question about Random forest.

2011-07-12 Thread Xiaobo Gu
Hi,

The Random Forest partial implementation in
https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
use the ARFF file format, is ARFF the only supportted file format when
using the BuildForest and TestForest program, and are BuildForest and
TestForest program are official tools to build Random Forest models
from the command line?

Regards,

Xiaobo Gu


Re: Using tf-idf vectors to train Naive Bayes

2011-07-12 Thread Robin Anil
Which version of naivebayes are you using?

bayes.* package or naivebayes.* ?

Former uses text input. Latter one uses vectors.

On Tue, Jul 12, 2011 at 7:59 PM, kevin_ravel ke...@raveldata.com wrote:

 I'm a little confused as to the proper way to format the data for training
 a
 naive bayes classifier. Is it possible to give the classifier tfidf-vectors
 generated using the results from seq2sparse?  I have arranged it so that I
 have a sequence file where the key is the target variable and the value is
 a
 tfidf vector. When I use this as the input to trainclassifier I get the
 following error:

 Running on hadoop, using HADOOP_HOME=/home/kevin/Hadoop/hadoop-0.20.2/
 No HADOOP_CONF_DIR set, using /home/kevin/Hadoop/hadoop-0.20.2//src/conf
 11/07/12 09:27:13 WARN driver.MahoutDriver: No trainclassifier.props found
 on classpath, will use command-line arguments only
 11/07/12 09:27:13 INFO bayes.TrainClassifier: Training Bayes Classifier
 11/07/12 09:27:13 INFO bayes.BayesDriver: Reading features...
 11/07/12 09:27:13 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 11/07/12 09:27:14 INFO mapred.FileInputFormat: Total input paths to process
 : 1
 11/07/12 09:27:14 INFO mapred.JobClient: Running job: job_201107120921_0001
 11/07/12 09:27:15 INFO mapred.JobClient:  map 0% reduce 0%
 11/07/12 09:27:24 INFO mapred.JobClient: Task Id :
 attempt_201107120921_0001_m_00_0, Status : FAILED
 java.lang.RuntimeException: Error in configuring object
at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at

 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 5 more


 Thanks

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Using-tf-idf-vectors-to-train-Naive-Bayes-tp3162590p3162590.html
 Sent from the Mahout User List mailing list archive at Nabble.com.



Re: What's the accuracy of random forests in Mahout?

2011-07-12 Thread Ted Dunning
I don't believe that Mahout's random forests have been used in production.
 I have heard that some people got pretty good results in testing.

On Tue, Jul 12, 2011 at 6:03 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote:

 Hi,

 When the training data set can be loaded into memory, or each split
 can be, what's accuracy of the decision forest algorithm, compared
 with LogisticRegression. Do you have production usages with random
 forest?

 Regards,

 Xiaobo Gu



ItemSimilarity pre-processing

2011-07-12 Thread Abmar Barros
Hi all,

I am new to Mahout and I am putting up a Recommender for buddycloud (
http://buddycloud.com/) as a part of my GSoC project (
https://github.com/buddycloud/channel-directory).
In the testing snapshot, I got ~100k users, ~20k items and ~230k boolean
taste preferences.
At first I tried an UserBasedRecommender, with an all-in-memory DataModel
(read from dump file, created a GenericDataModel). The recommendations
performed great, almost real time. However, I thought this strategy wouldn't
scale, once the number of users and items tend to increase, and then the
service could run out-of-memory.

Then I tried a PostgreSQLBooleanPrefJDBCDataModel, and, as expected, the
performance dropped drastically. After reading the blog post at
http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/,
I decided to try an ItemBasedRecommender, using a preprocessed
ItemSimilarity table. I am trying to not use MapReduce at first, thus I
tried to compute the LogLikehood similarity from every pair of item. This
took too long, and then I gave up.

Finally, my questions are: Am I doing things right? What is the best way to
compute item similarity offline without MapReduce?

Thanks in advance!
Abmar

-- 
Abmar Barros
MSc candidate on Computer Science at Federal University of Campina Grande -
www.ufcg.edu.br
OurGrid Team Member - www.ourgrid.org
Paraíba - Brazil


Random Forest feature types

2011-07-12 Thread Don Pazel
From what I can see, the random forest implementation takes either numerical 
or categorical feature data.  That worked fine for me, until I tried to 
incorporate word or text features.  I liked the encoders used in SGD, but they 
don't seem to apply to random forests.  So, did I overlook something simple 
that would allow me to include word or text features?  If not, are there plans 
(assuming the core algorithm allows) to add these feature types to random 
forests in the future?


Thanks,
Don Pazel 

Re: combination of features worsen the performance

2011-07-12 Thread Weihua Zhu
Hi Ted,

 Thanks very much for your very detailed reply. It is very helpful. 
 still some questions. I hope i am not polluting this email list much..
 
I understand all your comments except below:
 Finally, you should be combining group ranking objective as well as
 regression objectives.  Otherwise, your model will simply be learning which
 users are likely to click on anything and those users who will never click
 on anything.  There are provisions for segmented AUC in the code, but that
 will only work for binary targets.  In general, it is common to build
 cascaded models to deal with this.  The first model learns to predict click
 and the cascaded model learns conversion conditional on click.

We can use binary targets; that shouldn't be a problem. 
Could you say a little more about segmented AUC? also about the cascaded 
models?
Do you have an reference papers/book/codesSamples/example projects for 
recommendation? 
I have the mahout in action book, but seems i didn't see stuff like that...
Thanks again for your help..


-Weihua


On Jul 11, 2011, at 3:30 PM, Ted Dunning wrote:

 There are lots of problems with the problem as posed.  I am not surprised
 with poor results.
 
 You should not downsample negative examples so severely.  I would keep as
 many as 10-30 x as many positive examples you have.  Even then, I suspect
 you don't have enough data especially if you have already included data for
 all of your models.
 
 Your Feature A is not useful unless you are putting all ad results together.
  Even then, you need to include more advertiser, campaign and ad specific
 features.
 
 The feature vector size of 10,000 is actually relatively small if you have
 any reasonable degree of sparsity in your user and ad features.  Unused
 features do not hurt learning.
 
 Finally, you should be combining group ranking objective as well as
 regression objectives.  Otherwise, your model will simply be learning which
 users are likely to click on anything and those users who will never click
 on anything.  There are provisions for segmented AUC in the code, but that
 will only work for binary targets.  In general, it is common to build
 cascaded models to deal with this.  The first model learns to predict click
 and the cascaded model learns conversion conditional on click.
 
 Most importantly, really, I would recommend that you experiment with model
 design using a system like R so that you can get fast turn-around on
 modeling efforts.
 
 On Mon, Jul 11, 2011 at 3:04 PM, Weihua Zhu w...@adconion.com wrote:
 
 hi Thanks Ted.
 I understand that the training dataset size is small. The reason is that we
 have very limited number of action class events/instances.  We also want
 to make each target class have equal number of events/instances.
 Feature A is the advertisement campaign ID, and Feature B is the behaviors
 that internet user has, for example, gender:male, country: us, etc.
 I set the size of the encoder to 1, which is very large.
 I used this setup for  OnlineLogisticRegressioN:
   olr = new OnlineLogisticRegression(3, FEATURES, new L1());
   olr.alpha(1).stepOffset(1000).lambda(3e-5).learningRate(3);
 
 Thanks.
 
 -wz
 
 
 On Jul 11, 2011, at 2:49 PM, Ted Dunning wrote:
 
 This is a tiny amount of data.  The regularization in Mahout's SGD
 implementation is probably not as effective as second order techniques
 for
 such tiny data.
 
 Btw... you didn't answer my questions about what kind of data feature A
 and
 B are.  I understand that you might be shy about this, but without that
 kind
 of information, I can't help you.
 
 (and add this additional question)
 
 What is the size of the encoded vector?
 
 On Mon, Jul 11, 2011 at 2:26 PM, Weihua Zhu w...@adconion.com wrote:
 
 Target class is if a user click an ad(advertisement), buy through an ad,
 or
 not; so 3 classes.
 Feature A s about the Advertisement itself;
 Feature B is about the user's behaviors;
 Currently im only using feature A and B.
 Total training data is 250 for each class;
 
 thanks..
 
 
 
 From: Ted Dunning [ted.dunn...@gmail.com]
 Sent: Monday, July 11, 2011 2:15 PM
 To: user@mahout.apache.org
 Subject: Re: combination of features worsen the performance
 
 Can you say a little bit about the data?
 
 What are features A and B?  What kind of data do they represent?
 
 How many other features are there?
 
 What is the target variable?  How many possible values does it have?
 
 How much training data do you have?
 
 What sort of training are you doing?
 
 
 
 On Mon, Jul 11, 2011 at 2:08 PM, Weihua Zhu w...@adconion.com wrote:
 
 Hi, Dear all,
 
 I am using mahout logistic regression for classification;
 interestingly,
 for feature A, B, individually each has satisfactory performances, say
 65%,
 80%, but when i combine them together(using encoder), the performance
 is
 like 72%. Shouldn't the performance be better? Any thoughts? Thanks a
 lot,
 
 
 -wz.
 
 
 
 



Re: combination of features worsen the performance

2011-07-12 Thread Weihua Zhu
thanks. We are trying to get larger dataset. probably over 2000 for each class.
what do you mean by the errors on performance estimates? the confusion matrix?


On Jul 11, 2011, at 2:44 PM, Konstantin Shmakov wrote:

 It seems  that training data set is way too small. What are the errors
 on performance estimates?
 
 --
 
 On Mon, Jul 11, 2011 at 2:26 PM, Weihua Zhu w...@adconion.com wrote:
 Target class is if a user click an ad(advertisement), buy through an ad, or 
 not; so 3 classes.
 Feature A s about the Advertisement itself;
 Feature B is about the user's behaviors;
 Currently im only using feature A and B.
 Total training data is 250 for each class;
 
 thanks..
 
 
 
 From: Ted Dunning [ted.dunn...@gmail.com]
 Sent: Monday, July 11, 2011 2:15 PM
 To: user@mahout.apache.org
 Subject: Re: combination of features worsen the performance
 
 Can you say a little bit about the data?
 
 What are features A and B?  What kind of data do they represent?
 
 How many other features are there?
 
 What is the target variable?  How many possible values does it have?
 
 How much training data do you have?
 
 What sort of training are you doing?
 
 
 
 On Mon, Jul 11, 2011 at 2:08 PM, Weihua Zhu w...@adconion.com wrote:
 
 Hi, Dear all,
 
  I am using mahout logistic regression for classification; interestingly,
 for feature A, B, individually each has satisfactory performances, say 65%,
 80%, but when i combine them together(using encoder), the performance is
 like 72%. Shouldn't the performance be better? Any thoughts? Thanks a lot,
 
 
 -wz.
 
 
 
 
 
 -- 
 ksh:



Re: Connection Pooling

2011-07-12 Thread Salil Apte
Oh yea, at runtime, I'm getting back a BasicDataSource object for my
DataSource. Is that correct?

On Tue, Jul 12, 2011 at 9:59 PM, Salil Apte sa...@offlinelabs.com wrote:
 So I started actually looking at performance today and it is pretty
 horrendous. I've got about 61,000 rows in my database which I'm
 assuming isn't *that* many rows. But recommendations are taking  20
 seconds. Is there some way to ensure pooling is turned on? What else
 is a big driver for performance? My tables are setup so that I have a
 multiple index (for uniqueness) for user_id, item_id pairs. That
 way, there cannot be two entries with the same user_id, item_id. I'm
 not sure where to go from here.

 Thanks for the help!

 On Tue, Jul 12, 2011 at 12:47 AM, Sean Owen sro...@gmail.com wrote:
 You can ignore it. It just doesn't know for sure you have a pool.
 I believe I have even removed this in a recent refactoring.

 On Tue, Jul 12, 2011 at 2:21 AM, Salil Apte sa...@offlinelabs.com wrote:

 So I keep getting this warning from either Mahout or the server (I'm
 guessing the former):

 WARNING: You are not using ConnectionPoolDataSource. Make sure your
 DataSource pools connections to the database itself, or database
 performance will be severely reduced.

 I'm not really sure why this is happening. I have the following
 resource in my webapp's context.xml file. Is there anything else I
 need to do enable connection pooling with a  JNDI resource?

 Resource name=jdbc/offline-local auth=Container
 type=javax.sql.DataSource username=root password=
 driverClassName=com.mysql.jdbc.Driver

 url=jdbc:mysql://localhost:3306/offlinedevel?autoReconnect=trueamp;cachePreparedStatements=trueamp;cachePrepStmts=trueamp;cacheResultSetMetadata=trueamp;alwaysSendSetIsolation=falseamp;elideSetAutoCommits=true
 validationQuery=select 1 maxActive=16 maxIdle=4
 removeAbandoned=true logAbandoned=true /

 Thanks in advance.

 -Salil