Re: Dictionary file format in Lucene-Mahout integration
Stuthi, seq2sparse is not the right tool if the input is lucene indexes and one would have to go with lucene.vectors for the same given the input. From: Stuti Awasthi stutiawas...@hcl.com To: user@mahout.apache.org user@mahout.apache.org; James Forth jjamesfo...@yahoo.com Sent: Wednesday, June 5, 2013 5:30 AM Subject: RE: Dictionary file format in Lucene-Mahout integration Hi James, The seq2sparse class generate the dictionary in sequence file format with Key as Text and Value as Intwritable. You might need to generate the dictionary file in this format. Thanks Stuti -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, June 05, 2013 9:55 AM To: user@mahout.apache.org; James Forth Subject: Re: Dictionary file format in Lucene-Mahout integration Never used lucene.vector myself, thinking loud here. Assuming that dict.out is in TextFormat. You could use 'seqdirectory' to convert dict to a sequencefileformat. This can then be fed into cvb. From: James Forth jjamesfo...@yahoo.com To: user@mahout.apache.org user@mahout.apache.org Sent: Tuesday, June 4, 2013 8:00 PM Subject: Dictionary file format in Lucene-Mahout integration Hello, I’m wondering if anyone can help with a question about the dictionary format in lucene.vector-cvb integration. I’ve previously used the pathway from text files: seqdirectory seq2sparse rowid cvb and it works fine. The dictionary created by seq2sparse is in sequence file format, and this is accepted by cvb. But when using a pathway from a lucene index: lucene.vector cvb there is a problem with cvb throwing the error “dict.out not a SequenceFile”. Lucene.vector appears to generate a dictionary in plain text format, but cvb requires it in sequence file format. Does anyone know how to use lucence.vector with cvb, which I assume means obtaining a dictionary as a sequence file from lucene.vector? Thanks for your help. James ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Database connection pooling for a recommendation engine
Hello, I am considering to implement a recommendation engine for a small size website. The website will employ LAMP stack, and for some reasons the recommendation engine must be written in C++. It consists of an On-line Component and Off-line Component, both need to connect to MySQL. The difference is that On-line Component will need a connection pool, whereas several persistent connections or even connect as required would be sufficient for the Off-line Component, since it does not require real time performance in a concurrent requests scenario as in On-line Component. On-line Component is to be wrapped as a web service via Apache AXIS2. The PHP frontend app on Apache http server retrieves recommendation data from this web service module. There are two DB connection options for On-line Component I can think of: 1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use connection pool APIs that come as a part of Apache HTTP server. mod_dbd would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html As for Off-line Component, a simple DB connection option is direct connection using ODBC. Due to lack of web app design experience, I have the following questions: Option 1 for On-line Component is a tightly coupled design without taking advantage of pooling APIs in Apache HTTP server. But if I choose Option 2 (3-tiered architecture), as a standalone component apart from Apache HTTP server, how to use its connection pool APIs? A Java application can be deployed as a WAR file and contained in a servlet container such as tomcat(See Mahout in Action, section 5.5), or it can use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource ( https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation). Is there any similar approach for my C++ recommendation engine? I am not sure if I made a proper prototype. Any suggestions will be appreciated:) Thanks, Mike
Re: Database connection pooling for a recommendation engine
Not sure, is this really related to Mahout? I don't know of an equivalent of J2EE / Tomcat for C++, but there must be something. As a general principle, you will have to load your data into memory if you want to perform the computations on the fly in real time. So how you access the data isn't so important, just because you will be reading it all at once. On Wed, Jun 5, 2013 at 12:44 PM, Mike W. liansh...@gmail.com wrote: Hello, I am considering to implement a recommendation engine for a small size website. The website will employ LAMP stack, and for some reasons the recommendation engine must be written in C++. It consists of an On-line Component and Off-line Component, both need to connect to MySQL. The difference is that On-line Component will need a connection pool, whereas several persistent connections or even connect as required would be sufficient for the Off-line Component, since it does not require real time performance in a concurrent requests scenario as in On-line Component. On-line Component is to be wrapped as a web service via Apache AXIS2. The PHP frontend app on Apache http server retrieves recommendation data from this web service module. There are two DB connection options for On-line Component I can think of: 1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use connection pool APIs that come as a part of Apache HTTP server. mod_dbd would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html As for Off-line Component, a simple DB connection option is direct connection using ODBC. Due to lack of web app design experience, I have the following questions: Option 1 for On-line Component is a tightly coupled design without taking advantage of pooling APIs in Apache HTTP server. But if I choose Option 2 (3-tiered architecture), as a standalone component apart from Apache HTTP server, how to use its connection pool APIs? A Java application can be deployed as a WAR file and contained in a servlet container such as tomcat(See Mahout in Action, section 5.5), or it can use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource ( https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation). Is there any similar approach for my C++ recommendation engine? I am not sure if I made a proper prototype. Any suggestions will be appreciated:) Thanks, Mike
Re: Database connection pooling for a recommendation engine
Hi Mike, the following paper contains some comparisons between different database stacks. I can also give you the QtSQL code if you are interested in it. http://www.manuel-blechschmidt.de/data/MMRPG2.pdf /Manuel Am 05.06.2013 um 13:44 schrieb Mike W.: Hello, I am considering to implement a recommendation engine for a small size website. The website will employ LAMP stack, and for some reasons the recommendation engine must be written in C++. It consists of an On-line Component and Off-line Component, both need to connect to MySQL. The difference is that On-line Component will need a connection pool, whereas several persistent connections or even connect as required would be sufficient for the Off-line Component, since it does not require real time performance in a concurrent requests scenario as in On-line Component. On-line Component is to be wrapped as a web service via Apache AXIS2. The PHP frontend app on Apache http server retrieves recommendation data from this web service module. There are two DB connection options for On-line Component I can think of: 1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use connection pool APIs that come as a part of Apache HTTP server. mod_dbd would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html As for Off-line Component, a simple DB connection option is direct connection using ODBC. Due to lack of web app design experience, I have the following questions: Option 1 for On-line Component is a tightly coupled design without taking advantage of pooling APIs in Apache HTTP server. But if I choose Option 2 (3-tiered architecture), as a standalone component apart from Apache HTTP server, how to use its connection pool APIs? A Java application can be deployed as a WAR file and contained in a servlet container such as tomcat(See Mahout in Action, section 5.5), or it can use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource ( https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation). Is there any similar approach for my C++ recommendation engine? I am not sure if I made a proper prototype. Any suggestions will be appreciated:) Thanks, Mike -- Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
Re: Dictionary file format in Lucene-Mahout integration
{code} File dictOutFile = new File(dictOut); log.info(Dictionary Output file: {}, dictOutFile); Writer writer = Files.newWriter(dictOutFile, Charsets.UTF_8); DelimitedTermInfoWriter tiWriter = new DelimitedTermInfoWriter(writer, delimiter, field); try { tiWriter.write(termInfo); } finally { Closeables.closeQuietly(tiWriter); } {code} Is the culprit in the Lucene Driver class. The way to fix this would be to abstract the writer and allow it to use other implementations, namely one that supported the seq 2 sparse format. Any chance you are up for patching it James? -Grant On Jun 5, 2013, at 2:00 AM, James Forth jjamesfo...@yahoo.com wrote: Hello, I’m wondering if anyone can help with a question about the dictionary format in lucene.vector-cvb integration. I’ve previously used the pathway from text files: seqdirectory seq2sparse rowid cvb and it works fine. The dictionary created by seq2sparse is in sequence file format, and this is accepted by cvb. But when using a pathway from a lucene index: lucene.vector cvb there is a problem with cvb throwing the error “dict.out not a SequenceFile”. Lucene.vector appears to generate a dictionary in plain text format, but cvb requires it in sequence file format. Does anyone know how to use lucence.vector with cvb, which I assume means obtaining a dictionary as a sequence file from lucene.vector? Thanks for your help. James Grant Ingersoll | @gsingers http://www.lucidworks.com
Why are clustering emails not clustering similar stuff?
I tried to cluster 1000 emails of a person using Kmeans, but clusters are not forming okay. For example if Facebook sends notifications about James Doe and 5 other people, I get 5 clusters like: :VL-858{n=7 Top Terms: doe = 10.066998481750488 james= 10.066998481750488 Why are notifications for all 5 people not getting clustered together? I used variants of the commands used in Mahout in Action, Sean Owen et al as follows: Vectorizing uses lowercasing, stop words and length filter: bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o mymail-vectors-bigram -ow -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get similar results but half the number of emails get into any cluster. bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver kmeans -i mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o mymail-kmeans-clusters-from-bigrams -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x 20 -cl -- We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch