Re: Dictionary file format in Lucene-Mahout integration

2013-06-05 Thread Suneel Marthi
Stuthi, seq2sparse is not the right tool if the input is lucene indexes and one 
would have to go with lucene.vectors for the same given the input.





From: Stuti Awasthi stutiawas...@hcl.com
To: user@mahout.apache.org user@mahout.apache.org; James Forth 
jjamesfo...@yahoo.com 
Sent: Wednesday, June 5, 2013 5:30 AM
Subject: RE: Dictionary file format in Lucene-Mahout integration
 

Hi James,
The seq2sparse class generate the dictionary in sequence file format with Key 
as Text and Value as Intwritable. You might need to generate the dictionary 
file in this format.

Thanks
Stuti

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Wednesday, June 05, 2013 9:55 AM
To: user@mahout.apache.org; James Forth
Subject: Re: Dictionary file format in Lucene-Mahout integration

Never used lucene.vector myself,  thinking loud here. Assuming that dict.out is 
in TextFormat.
You could use 'seqdirectory' to convert dict to a sequencefileformat. 

This can then be fed into cvb.





From: James Forth jjamesfo...@yahoo.com
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Tuesday, June 4, 2013 8:00 PM
Subject: Dictionary file format in Lucene-Mahout integration


Hello,


I’m wondering if anyone can help with a question about the dictionary format in
lucene.vector-cvb integration.  I’ve previously used the pathway from text
files:  seqdirectory 
seq2sparse  rowid  cvb  and it works fine.  The
dictionary created by seq2sparse is in sequence file format, and this is 
accepted by cvb.

But when using a pathway from a lucene index:  lucene.vector  cvb  there is a 
problem with cvb throwing the error “dict.out not a SequenceFile”. 
Lucene.vector appears to generate a dictionary in plain text format, but cvb
requires it in sequence file format.

Does anyone know how to use lucence.vector with cvb, which I assume means
obtaining a dictionary as a sequence file from lucene.vector?

Thanks for your help.

James


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.



Database connection pooling for a recommendation engine

2013-06-05 Thread Mike W.
Hello,

I am considering to implement a recommendation engine for a small size
website. The website will employ LAMP stack, and for some reasons the
recommendation engine must be written in C++. It consists of an On-line
Component and Off-line Component, both need to connect to MySQL. The
difference is that On-line Component will need a connection pool, whereas
several persistent connections or even connect as required would be
sufficient for the Off-line Component, since it does not require real time
performance in a concurrent requests scenario as in On-line Component.

On-line Component is to be wrapped as a web service via Apache AXIS2. The
PHP frontend app on Apache http server retrieves recommendation data from
this web service module.

There are two DB connection options for On-line Component I can think of:
1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use
connection pool APIs that come as a part of Apache HTTP server. mod_dbd
would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html

As for Off-line Component, a simple DB connection option is direct
connection using ODBC.

Due to lack of web app design experience, I have the following questions:

Option 1 for On-line Component is a tightly coupled design without taking
advantage of pooling APIs in Apache HTTP server. But if I choose Option 2
(3-tiered architecture), as a standalone component apart from Apache HTTP
server, how to use its connection pool APIs?

A Java application can be deployed as a WAR file and contained in a servlet
container such as tomcat(See Mahout in Action, section 5.5), or it can
use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
(
https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation).
Is there any similar approach for my C++ recommendation engine?

I am not sure if I made a proper prototype. Any suggestions will be
appreciated:)

Thanks,

Mike


Re: Database connection pooling for a recommendation engine

2013-06-05 Thread Sean Owen
Not sure, is this really related to Mahout?

I don't know of an equivalent of J2EE / Tomcat for C++, but there must
be something.

As a general principle, you will have to load your data into memory if
you want to perform the computations on the fly in real time. So how
you access the data isn't so important, just because you will be
reading it all at once.

On Wed, Jun 5, 2013 at 12:44 PM, Mike W. liansh...@gmail.com wrote:
 Hello,

 I am considering to implement a recommendation engine for a small size
 website. The website will employ LAMP stack, and for some reasons the
 recommendation engine must be written in C++. It consists of an On-line
 Component and Off-line Component, both need to connect to MySQL. The
 difference is that On-line Component will need a connection pool, whereas
 several persistent connections or even connect as required would be
 sufficient for the Off-line Component, since it does not require real time
 performance in a concurrent requests scenario as in On-line Component.

 On-line Component is to be wrapped as a web service via Apache AXIS2. The
 PHP frontend app on Apache http server retrieves recommendation data from
 this web service module.

 There are two DB connection options for On-line Component I can think of:
 1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use
 connection pool APIs that come as a part of Apache HTTP server. mod_dbd
 would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html

 As for Off-line Component, a simple DB connection option is direct
 connection using ODBC.

 Due to lack of web app design experience, I have the following questions:

 Option 1 for On-line Component is a tightly coupled design without taking
 advantage of pooling APIs in Apache HTTP server. But if I choose Option 2
 (3-tiered architecture), as a standalone component apart from Apache HTTP
 server, how to use its connection pool APIs?

 A Java application can be deployed as a WAR file and contained in a servlet
 container such as tomcat(See Mahout in Action, section 5.5), or it can
 use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
 (
 https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation).
 Is there any similar approach for my C++ recommendation engine?

 I am not sure if I made a proper prototype. Any suggestions will be
 appreciated:)

 Thanks,

 Mike


Re: Database connection pooling for a recommendation engine

2013-06-05 Thread Manuel Blechschmidt
Hi Mike,
the following paper contains some comparisons between different database stacks.

I can also give you the QtSQL code if you are interested in it.

http://www.manuel-blechschmidt.de/data/MMRPG2.pdf

/Manuel

Am 05.06.2013 um 13:44 schrieb Mike W.:

 Hello,
 
 I am considering to implement a recommendation engine for a small size
 website. The website will employ LAMP stack, and for some reasons the
 recommendation engine must be written in C++. It consists of an On-line
 Component and Off-line Component, both need to connect to MySQL. The
 difference is that On-line Component will need a connection pool, whereas
 several persistent connections or even connect as required would be
 sufficient for the Off-line Component, since it does not require real time
 performance in a concurrent requests scenario as in On-line Component.
 
 On-line Component is to be wrapped as a web service via Apache AXIS2. The
 PHP frontend app on Apache http server retrieves recommendation data from
 this web service module.
 
 There are two DB connection options for On-line Component I can think of:
 1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use
 connection pool APIs that come as a part of Apache HTTP server. mod_dbd
 would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html
 
 As for Off-line Component, a simple DB connection option is direct
 connection using ODBC.
 
 Due to lack of web app design experience, I have the following questions:
 
 Option 1 for On-line Component is a tightly coupled design without taking
 advantage of pooling APIs in Apache HTTP server. But if I choose Option 2
 (3-tiered architecture), as a standalone component apart from Apache HTTP
 server, how to use its connection pool APIs?
 
 A Java application can be deployed as a WAR file and contained in a servlet
 container such as tomcat(See Mahout in Action, section 5.5), or it can
 use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
 (
 https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation).
 Is there any similar approach for my C++ recommendation engine?
 
 I am not sure if I made a proper prototype. Any suggestions will be
 appreciated:)
 
 Thanks,
 
 Mike

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B



Re: Dictionary file format in Lucene-Mahout integration

2013-06-05 Thread Grant Ingersoll
{code}
File dictOutFile = new File(dictOut);
log.info(Dictionary Output file: {}, dictOutFile);
Writer writer = Files.newWriter(dictOutFile, Charsets.UTF_8);
DelimitedTermInfoWriter tiWriter = new DelimitedTermInfoWriter(writer, 
delimiter, field);
try {
  tiWriter.write(termInfo);
} finally {
  Closeables.closeQuietly(tiWriter);
}
{code}

Is the culprit in the Lucene Driver class.  The way to fix this would be to 
abstract the writer and allow it to use other implementations, namely one that 
supported the seq 2 sparse format.

Any chance you are up for patching it James?

-Grant

On Jun 5, 2013, at 2:00 AM, James Forth jjamesfo...@yahoo.com wrote:

 Hello,
 
 
 I’m wondering if anyone can help with a question about the dictionary format 
 in
 lucene.vector-cvb integration.  I’ve previously used the pathway from text
 files:  seqdirectory 
 seq2sparse  rowid  cvb  and it works fine.  The
 dictionary created by seq2sparse is in sequence file format, and this is 
 accepted by cvb.
 
 But when using a pathway from a lucene index:  lucene.vector  cvb  there is 
 a problem with cvb throwing the error “dict.out not a SequenceFile”. 
 Lucene.vector appears to generate a dictionary in plain text format, but cvb
 requires it in sequence file format.
 
 Does anyone know how to use lucence.vector with cvb, which I assume means
 obtaining a dictionary as a sequence file from lucene.vector?
 
 Thanks for your help.
 
 James


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Why are clustering emails not clustering similar stuff?

2013-06-05 Thread Jesvin Jose
I tried to cluster 1000 emails of a person using Kmeans, but clusters are
not forming okay. For example if Facebook sends notifications about James
Doe and 5 other people, I get 5 clusters like:

:VL-858{n=7
Top Terms:
doe   =  10.066998481750488
james=  10.066998481750488

Why are notifications for all 5 people not getting clustered together? I
used variants of the commands used in Mahout in Action, Sean Owen et al as
follows:

Vectorizing uses lowercasing, stop words and length filter:

bin/hadoop jar
/home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
similar results but half the number of emails get into any cluster.

bin/hadoop jar
/home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
org.apache.mahout.driver.MahoutDriver kmeans -i
mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
mymail-kmeans-clusters-from-bigrams -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
20 -cl

-- 
We dont beat the reaper by living longer. We beat the reaper by living well
and living fully. The reaper will come for all of us. Question is, what do
we do between the time we are born and the time he shows up? -Randy Pausch