Re: forrest 1, benson 0

2010-01-13 Thread David Hall
I had this problem too, though I guess at the time I was running
normal leopard, and not snow leopard.

This link might help?

http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard

-- David

On Wed, Jan 13, 2010 at 3:08 PM, Benson Margulies bimargul...@gmail.com wrote:
 forrest does not support Java 1.6. MacOSX does not support Java 1.5.



[jira] Commented: (MAHOUT-227) Parallel SVM

2009-12-20 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793053#action_12793053
 ] 

David Hall commented on MAHOUT-227:
---

As Ted hints, a proposal should really be placed on the wiki. 
http://cwiki.apache.org/MAHOUT/

Looking forward to it.

 Parallel SVM
 

 Key: MAHOUT-227
 URL: https://issues.apache.org/jira/browse/MAHOUT-227
 Project: Mahout
  Issue Type: Task
  Components: Classification
Reporter: zhao zhendong
 Attachments: svmProposal.patch


 I wrote a proposal of parallel algorithm for SVM training. Any comment is 
 welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SVM algo, code, etc.

2009-12-15 Thread David Hall
On Fri, Dec 11, 2009 at 5:02 AM, Jake Mannix jake.man...@gmail.com wrote:
 I really feel like I should respond to this, but seeing as I live on the
 west coast
 of the US, going to bed might be more advisable.

 On a very specific topic of SVMs, I can certainly look into this, but David,
 were you interested in helping bring this into Mahout and help maintain it?
 You are often rather quiet on here, yet happened to jump in as this topic
 came up?

Yeah, the first semester of the PhD program has been far more busy
than I imagined, and I've been overwhelmed. (Now it's finals week.)

Online optimization has kind of caught my eye of late, with Pegasos
being something I had been thinking about implementing. I would be
glad to get this up and running, though I'd like more to help curate a
patch. Pegasos or no.

-- David


  -jake

 On Fri, Dec 11, 2009 at 4:40 AM, Sean Owen sro...@gmail.com wrote:

 This is a timely message, since I'm currently presuming to close some
 old Mahout issues at the moment and it raises a related concern.

 There's lots of old JIRA issues of the form:
 1) somebody submits a patch implementing part of something
 2) some comments happen, maybe
 3) nothing happens for a year
 4) I close it now

 At an early stage, this is fine actually. 20 people contribute at the
 start; 3 select themselves naturally as regular contributors. 20
 patches go up; the 5 that are of use an interest naturally get picked
 up and eventually committed. But going forward, this probably won't
 do. Potential committers get discouraged and work goes wasted. (See
 comments about Commons Math on this list for an example of the
 fallout.)

 I wonder what the obstacles are to avoiding this?

 1) Do we need to be clearer about what the project is and isn't about?
 What the priorities are, what work is already on the table to be done?
 This is why I am keen on cleaning up JIRA now; it's hard for even us
 to understand what's in progress, what's important,

 2) Do we need some more official ownership or responsibility for
 components? For example I am not sure who would manage changes to
 Clustering stuff. I know it isn't me; I don't know about that part. So
 what happens to an incoming patch to clustering? While too much
 command-and-control isn't possible or desirable in open source, lack
 of it is harmful too. I don't think the answer is just let people
 commit bits and bobs since it makes the project appear to be a
 workbench of half-finished jobs, which does a disservice to the
 components that are polished.


 I have no reason to believe this SVM patch, should it materialize,
 would fall through the cracks in this way, but want to ask now how we
 can just make sure. So, can we answer:

 1) Is SVM in scope for Mahout? (I am guessing so.)
 2) Who is nominally committing to shepherd the code into the code base
 and fix bugs and answer questions? (Jake?)


 I'm not really bothered about this particular patch, but the more
 general question.




[jira] Assigned: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper

2009-12-06 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall reassigned MAHOUT-197:
-

Assignee: David Hall

 LDADriver: No job jar file set leads to ClassNotFoundException: 
 org.apache.mahout.clustering.lda.LDAMapper
 --

 Key: MAHOUT-197
 URL: https://issues.apache.org/jira/browse/MAHOUT-197
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.2
 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 
 0.20.1, build from r834311 of 
 http://svn.apache.org/repos/asf/lucene/mahout/trunk
Reporter: Drew Farris
Assignee: David Hall
Priority: Minor
 Attachments: LDADriver-setJar.patch


 hadoop jar 
 core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver
  -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 
 --maxIter 2
 [...]
 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or
 JobConf#setJar(String).
 [...]
 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1
 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005
 09/11/09 22:02:02 INFO mapred.JobClient:  map 0% reduce 0%
 09/11/09 22:02:12 INFO mapred.JobClient: Task Id :
 attempt_200911091316_0005_m_00_0, Status : FAILED
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808)
at 
 org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 Can be fixed by adding the following line to LDADriver after line 299 in 
 r831743:
 job.setJarByClass(LDADriver.class);
 (will attach trivial patch)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper

2009-12-06 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-197:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Fixed in 887843 

Thanks for the patch!

 LDADriver: No job jar file set leads to ClassNotFoundException: 
 org.apache.mahout.clustering.lda.LDAMapper
 --

 Key: MAHOUT-197
 URL: https://issues.apache.org/jira/browse/MAHOUT-197
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.2
 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 
 0.20.1, build from r834311 of 
 http://svn.apache.org/repos/asf/lucene/mahout/trunk
Reporter: Drew Farris
Assignee: David Hall
Priority: Minor
 Attachments: LDADriver-setJar.patch


 hadoop jar 
 core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver
  -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 
 --maxIter 2
 [...]
 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or
 JobConf#setJar(String).
 [...]
 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1
 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005
 09/11/09 22:02:02 INFO mapred.JobClient:  map 0% reduce 0%
 09/11/09 22:02:12 INFO mapred.JobClient: Task Id :
 attempt_200911091316_0005_m_00_0, Status : FAILED
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808)
at 
 org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 Can be fixed by adding the following line to LDADriver after line 299 in 
 r831743:
 job.setJarByClass(LDADriver.class);
 (will attach trivial patch)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SVM algo, code, etc.

2009-12-03 Thread David Hall
On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost isa...@apache.org wrote:
 On Fri Grant Ingersoll gsing...@apache.org wrote:
 On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
  Post a patch if you'd like to proceed, IMHO.
 +1

 +1 from me as well. I would love to see solid svm support in Mahout.

And another +1 from me. If you want a pointer, I've recently stumbled
on a new solver for SVMs that seems to be remarkably easy to
implement.

It's called Pegasos:

ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf

-- David


Re: SVM algo, code, etc.

2009-12-03 Thread David Hall
On Thu, Dec 3, 2009 at 2:12 AM, Olivier Grisel olivier.gri...@ensta.org wrote:
 2009/12/3 Ted Dunning ted.dunn...@gmail.com:
 Very interesting results, particularly the lack of dependence on data size.

 On Thu, Dec 3, 2009 at 12:02 AM, David Hall d...@cs.berkeley.edu wrote:

 On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost isa...@apache.org wrote:
  On Fri Grant Ingersoll gsing...@apache.org wrote:
  On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
   Post a patch if you'd like to proceed, IMHO.
  +1
 
  +1 from me as well. I would love to see solid svm support in Mahout.

 And another +1 from me. If you want a pointer, I've recently stumbled
 on a new solver for SVMs that seems to be remarkably easy to
 implement.

 It's called Pegasos:

 ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdfhttp://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf

 Pegasos and other online implementations of SVMs based on regularized
 variants of stochastic gradient descent are indeed amenable to large
 scale problems. They solve the SVM optimization problem with a
 stochastic approximation of the primal (as opposed to more 'classical'
 solvers such as libsvm that solve the dual problem using Sequential
 Minimal Optimization). However SGD based SVM implementation are
 currently limited to the linear 'kernel' (which is often expressive
 enough for common NLP tasks such as document categorization).

You seem to know far more about this than I do, but the paper I linked
to specifically says (page 6) that they can use Mercer kernels.

-- David


Re: LDA for multi label classification was: Mahout Book

2009-10-16 Thread David Hall
On Fri, Oct 16, 2009 at 4:08 AM, zhao zhendong zhaozhend...@gmail.com wrote:
 I have seen the implementation of L-LDA using Java,
 Stanford Topic Modeling Toolbox http://nlp.stanford.edu/software/tmt/
 Does any one know whether they provide the source code or not?

I'm pretty sure it's scala, no? It's definitely open source. Like I
said, however, this implementation is almost certainly Gibbs sampling
based, which has consequences for parallelization (or rather, the
Rao-Blackwellization does.)

-- David

 Thanks,
 Maxim
 On Fri, Oct 16, 2009 at 12:39 PM, David Hall d...@cs.berkeley.edu wrote:

 Sorry, this slipped out of my inbox and I just found it!

 On Thu, Oct 8, 2009 at 12:05 PM, Robin Anil robin.a...@gmail.com wrote:
  Posting to the dev list.
  Great Paper Thanks!. Looks like L-LDA could be used to create some
  interesting examples.

 Thanks!

  The Paper shows L-LDA could be used to creating word-tag model for
 accurate
  tag(s) prediction given a document of words. I will complete reading and
  tell
  How much work is need to transform/build on top of current LDA
  implementation to L-LDA. any thoughts?

 Umm, cool! In the paper we used Gibbs sampling to do the inference,
 and the implementation in Mahout uses variational inference (because
 it distributes better). I don't see any obvious problems in terms of
 math, and so the rest is just fitting it in the system.

 I think a small amount of refactoring would be in order to make things
 more generic, and then it shouldn't be too hard to plug in. I'll add
 it to my list, but I'm swamped for quite some time.

 -- David

  Robin
  On Thu, Oct 8, 2009 at 11:50 PM, David Hall d...@cs.berkeley.edu
 wrote:
 
  The short answer is, that it probably won't help all that much. Naive
  Bayes is unreasonably good when you have enough data.
 
  The long answer is, I have a paper with Dan Ramage and Ramesh
  Nallapati that talks about how to do it.
 
  www.aclweb.org/anthology-new/D/D09/D09-1026.pdf
 
  In some sense, Labeled-LDA is a kind of Naive Bayes where you can
  have more than one class per document. If you have exactly one class
  per document, then LDA reduces to Naive Bayes (or the unsupervised
  variant of naive bayes which is basically k-means in multinomial
  space). If instead you wanted to project W words to K topics, with K 
  numWords, then there is something to do...
 
  That something is something like:
 
  1) get p(topic|word,document) for each word in each document (which is
  output by LDAInference). Those are your expected counts for each
  topic.
 
  2)For each class, do something like:
  p(topic|class) \propto  \sum_{document with that class,word}
  p(topic|word,document)
 
  Then just apply bayes rule to do classification:
 
  p(class|topics,document) \propto p(class) \prod p(topic|class,document)
 
  -- David
 
  On Thu, Oct 8, 2009 at 11:07 AM, Robin Anil robin.a...@gmail.com
 wrote:
   Thanks. Didnt see that, Fixed it!.
   I have a query
   How is the LDA topic model used to improve a classifier. Say Naive
   Bayes? If
   its possible, then I would like to integrate it into mahout.
   Given m classes and the associated documents, One can build m topic
   models
   right. (set of topics(words) under each label and the associated
   probability
   distribution of words).
   How can i use that info weight the most relevant topic of a class ?
  
  
 
   LDA has two meanings: linear discriminant analysis and latent
   dirichlet allocation. My code is the latter. The former is a kind of
   classification. You say linear discriminant analysis in the outline.
  
 
 
 




 --
 -

 Zhen-Dong Zhao (Maxim)

 

 Department of Computer Science
 School of Computing
 National University of Singapore


 Homepage:http://zhaozhendong.googlepages.com
 Mail: zhaozhend...@gmail.com




Re: LDA for multi label classification was: Mahout Book

2009-10-15 Thread David Hall
Sorry, this slipped out of my inbox and I just found it!

On Thu, Oct 8, 2009 at 12:05 PM, Robin Anil robin.a...@gmail.com wrote:
 Posting to the dev list.
 Great Paper Thanks!. Looks like L-LDA could be used to create some
 interesting examples.

Thanks!

 The Paper shows L-LDA could be used to creating word-tag model for accurate
 tag(s) prediction given a document of words. I will complete reading and
 tell
 How much work is need to transform/build on top of current LDA
 implementation to L-LDA. any thoughts?

Umm, cool! In the paper we used Gibbs sampling to do the inference,
and the implementation in Mahout uses variational inference (because
it distributes better). I don't see any obvious problems in terms of
math, and so the rest is just fitting it in the system.

I think a small amount of refactoring would be in order to make things
more generic, and then it shouldn't be too hard to plug in. I'll add
it to my list, but I'm swamped for quite some time.

-- David

 Robin
 On Thu, Oct 8, 2009 at 11:50 PM, David Hall d...@cs.berkeley.edu wrote:

 The short answer is, that it probably won't help all that much. Naive
 Bayes is unreasonably good when you have enough data.

 The long answer is, I have a paper with Dan Ramage and Ramesh
 Nallapati that talks about how to do it.

 www.aclweb.org/anthology-new/D/D09/D09-1026.pdf

 In some sense, Labeled-LDA is a kind of Naive Bayes where you can
 have more than one class per document. If you have exactly one class
 per document, then LDA reduces to Naive Bayes (or the unsupervised
 variant of naive bayes which is basically k-means in multinomial
 space). If instead you wanted to project W words to K topics, with K 
 numWords, then there is something to do...

 That something is something like:

 1) get p(topic|word,document) for each word in each document (which is
 output by LDAInference). Those are your expected counts for each
 topic.

 2)For each class, do something like:
 p(topic|class) \propto  \sum_{document with that class,word}
 p(topic|word,document)

 Then just apply bayes rule to do classification:

 p(class|topics,document) \propto p(class) \prod p(topic|class,document)

 -- David

 On Thu, Oct 8, 2009 at 11:07 AM, Robin Anil robin.a...@gmail.com wrote:
  Thanks. Didnt see that, Fixed it!.
  I have a query
  How is the LDA topic model used to improve a classifier. Say Naive
  Bayes? If
  its possible, then I would like to integrate it into mahout.
  Given m classes and the associated documents, One can build m topic
  models
  right. (set of topics(words) under each label and the associated
  probability
  distribution of words).
  How can i use that info weight the most relevant topic of a class ?
 
 

  LDA has two meanings: linear discriminant analysis and latent
  dirichlet allocation. My code is the latter. The former is a kind of
  classification. You say linear discriminant analysis in the outline.
 





Re: 0.2

2009-09-28 Thread David Hall
2009/9/28 Grant Ingersoll gsing...@apache.org:

 On Sep 28, 2009, at 2:16 PM, Ted Dunning wrote:

 Many of these are actually nearly (completely) done.

 Is there a goal for the 0.2 release other than fixing outstanding issues?

 I'd like to see some of the performance issues around SparseVector taken care 
 of.  I think we also said we wanted to get the Random Forest and Bayes stuff 
 in that Robin and Deneche are working on.

 Beyond that, I plan on doing some profiling of the LDA stuff.  I'd say we are 
 pretty close.

From my memory, using hprof, it looks like most of the time is spent doing 
math.

(I haven't had a chance to try out YourKit, though.)

-- David




 On Mon, Sep 28, 2009 at 10:53 AM, Grant Ingersoll gsing...@apache.orgwrote:

 Not too many open at this point:
 https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310751versionId=12313278showOpenIssuesOnly=true

 Some are relatively minor, others are ready, but just need a final review.
 Can we push towards mid-October for a release?  Anyone volunteer to be the
 release mgr?

 -Grant




 --
 Ted Dunning, CTO
 DeepDyve

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
 Solr/Lucene:
 http://www.lucidimagination.com/search




Re: 0.2 planning

2009-09-12 Thread David Hall
Any reason we haven't closed LDA yet
https://issues.apache.org/jira/browse/MAHOUT-123 ?

-- David

On Sat, Sep 12, 2009 at 11:45 AM, Grant Ingersoll gsing...@apache.org wrote:
 Here's the list of unresolved issues for 0.2:
 https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310751versionId=12313278showOpenIssuesOnly=true

 Can we start to work towards whittling these down and getting 0.2 out?
  Maybe by mid-October?

 -Grant



[jira] Commented: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs

2009-09-12 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754588#action_12754588
 ] 

David Hall commented on MAHOUT-172:
---

Sorry, just noticed this issue!

Looks good to me.

-- David

 When running on a Hadoop cluster LDA fails with Caused by: 
 java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
 -

 Key: MAHOUT-172
 URL: https://issues.apache.org/jira/browse/MAHOUT-172
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.1
Reporter: Isabel Drost
 Fix For: 0.2

 Attachments: lda.patch


 I tried running the reuters example of lda on a hadoop cluster today. Seems 
 like the implementation tries to read all files in output/state-* which fails 
 if in that directory _logs is found.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-161) Add Vector.norm to compute k-norms of vectors

2009-08-07 Thread David Hall (JIRA)
Add Vector.norm to compute k-norms of vectors
-

 Key: MAHOUT-161
 URL: https://issues.apache.org/jira/browse/MAHOUT-161
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: David Hall
 Fix For: 0.2
 Attachments: MAHOUT-161

This patch adds Vector.norm(double power) to Vector (and an implementation to 
AbstractVector).

AbstractVector.normalize now calls norm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-161) Add Vector.norm to compute k-norms of vectors

2009-08-07 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-161:
--

Attachment: MAHOUT-161

 Add Vector.norm to compute k-norms of vectors
 -

 Key: MAHOUT-161
 URL: https://issues.apache.org/jira/browse/MAHOUT-161
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: David Hall
 Fix For: 0.2

 Attachments: MAHOUT-161


 This patch adds Vector.norm(double power) to Vector (and an implementation to 
 AbstractVector).
 AbstractVector.normalize now calls norm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-161) Add Vector.norm to compute k-norms of vectors

2009-08-07 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740821#action_12740821
 ] 

David Hall commented on MAHOUT-161:
---

normalize(power) returns this.divide(norm(power)).

norm(p) == ||v||_p,

normalize(p) == v / ||v||_p

norm(p) is useful to check if a vector is sufficiently close to 0, e.g. if the 
difference between two phases of an optimization is close enough to convergence.

 Add Vector.norm to compute k-norms of vectors
 -

 Key: MAHOUT-161
 URL: https://issues.apache.org/jira/browse/MAHOUT-161
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: David Hall
 Fix For: 0.2

 Attachments: MAHOUT-161


 This patch adds Vector.norm(double power) to Vector (and an implementation to 
 AbstractVector).
 AbstractVector.normalize now calls norm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-08-07 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: MAHOUT-123.patch

The problem was that the edited patch wrote topics to .../examples/ and not 
../examples , which took a frustratingly long time to figure out.

I also played with the parameters a little longer. The topics aren't as great 
as I'd like, but it's because I haven't figured out the right setting for 
getting rid of stop words. could and said are still in there. That said, 
they're mostly coherent topics, if kind of boring.

-- David

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-08-03 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: MAHOUT-123.patch

Patch fixed for Yanen's problem. Apparently I messed up the dependencies 
somehow so that they'd work on my machine, but not anywhere else. Now I think 
it's ok. (I nuked my Maven repo.)

-- David


 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-08-02 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: MAHOUT-123.patch

Ok, core/bin/build-reuters will:

download reuters to work/reuters(something or another), untar it, build an 
index with it using lucene, convert said index into vectors, run lda for 40 
iterations (which is close enough to convergence) to work/lda, and then dump 
the top 100 words for each topic in into work/topics/topic-K, where K is the 
topic of interest.

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-08-02 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738165#action_12738165
 ] 

David Hall commented on MAHOUT-123:
---

I unfortunately haven't run it on a Hadoop cluster yet. It should just work 
if you run it with the right Hadoop configuration. Shouldn't running it through 
the hadoop shell script add the configuration?

I'll get it running on a hadoop cluster soon.

The code actually requires Hadoop 0.20, because Mahout has decided to move in 
that direction.

-- David

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-07-28 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736238#action_12736238
 ] 

David Hall commented on MAHOUT-123:
---

So it looks like the way Lucene does it is w/ an ant task. I can't figure out 
the maven way to do this, without my building some kind of jar from it. I'm 
happy to do it, but I'm not sure what the proper way to do this is.

Thoughts?

-- David

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-07-27 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735705#action_12735705
 ] 

David Hall commented on MAHOUT-123:
---

Ok, I'll add more comments. 

I had been using Reuters 21578, but I'm not convinced that it's ok to include 
it, and I was looking around for something better. I'll get the download 
automated for wikipedia chunks. Is a shell script ok to do most of it?

-- David

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
The Hadoop gods have seen fit to deprecate OutputCollector and replace
it with a non-static inner class of Mapper called Context. This
complicates several tests, namely:

./src/test/java/org/apache/mahout/classifier/bayes/BayesFeatureMapperTest.java
./src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java
./src/test/java/org/apache/mahout/clustering/dirichlet/TestMapReduce.java
./src/test/java/org/apache/mahout/clustering/fuzzykmeans/TestFuzzyKmeansClustering.java
./src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java
./src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java
./src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java
./src/test/java/org/apache/mahout/ga/watchmaker/EvalMapperTest.java

and my new LDA test.

These won't work in the new api. As far as I can tell, the only ways
to fix this are to either:
1) factor out a method that is testable, but this makes the code less idiomatic
2) abuse reflection to create a Mapper.Context (not sure about this,
but I imagine it's doable), and supply a utility method for this
3) individually override each Mapper class in each test and include
similar logic to get it done.

Thoughts?

-- David


Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
Sorry, I don't think that's gonna work for tests, unless I misunderstood you.

Here's the old interface:

 public void map(K key, V val, OutputCollectorK, V output, Reporter
reporter) throws IOException

And the new:

 public void map(K key, V value, Context context) throws IOException

The tests historically have called map(whatever,whatever2,
dummyCollector, null);

There is no obvious analog to a dummyCollector, precisely because
Context is a *non-static* class.

-- David

On Wed, Jul 22, 2009 at 1:49 PM, Sean Owensro...@gmail.com wrote:
 Copy-n-paste the old or new code and continue to use it?

 On Wed, Jul 22, 2009 at 9:26 PM, David Halld...@cs.stanford.edu wrote:
 The Hadoop gods have seen fit to deprecate OutputCollector and replace
 it with a non-static inner class of Mapper called Context. This
 complicates several tests, namely:



Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
It's a non-static inner class, you can't construct those outside of
the enclosing class can you? Or is my java that rusty?

-- David

On Wed, Jul 22, 2009 at 2:14 PM, Ted Dunningted.dunn...@gmail.com wrote:
 I don't quite understand how this complicates testing but leaves this code
 running.

 Why can these classes not construct a mock Context?

 On Wed, Jul 22, 2009 at 1:26 PM, David Hall d...@cs.stanford.edu wrote:

 These won't work in the new api. As far as I can tell, the only ways
 to fix this are to either:
 1) factor out a method that is testable, but this makes the code less
 idiomatic
 2) abuse reflection to create a Mapper.Context (not sure about this,
 but I imagine it's doable), and supply a utility method for this
 3) individually override each Mapper class in each test and include
 similar logic to get it done.




 --
 Ted Dunning, CTO
 DeepDyve



Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
Mapper.Context extends MapContext. The interfaces takes
Mapper.Context, and IIRC method arguments aren't contravariant in
Java.

-- David

On Wed, Jul 22, 2009 at 2:20 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Got it.

 Sorry to be dense.

 Is it really necessary to abuse reflection?  What about this:

 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/MapContext.html

 ?

 On Wed, Jul 22, 2009 at 2:15 PM, David Hall d...@cs.stanford.edu wrote:

 There is no obvious analog to a dummyCollector, precisely because
 Context is a *non-static* class.




 --
 Ted Dunning, CTO
 DeepDyve



Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
Oops, they are contravariant. Ok, so my proposed best practice is to
make Mapper's take a MapContext and not a Mapper.Context, and I'll
cook up a DummyContext real fast.

-- David

On Wed, Jul 22, 2009 at 2:22 PM, David Halld...@cs.stanford.edu wrote:
 Mapper.Context extends MapContext. The interfaces takes
 Mapper.Context, and IIRC method arguments aren't contravariant in
 Java.

 -- David

 On Wed, Jul 22, 2009 at 2:20 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Got it.

 Sorry to be dense.

 Is it really necessary to abuse reflection?  What about this:

 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/MapContext.html

 ?

 On Wed, Jul 22, 2009 at 2:15 PM, David Hall d...@cs.stanford.edu wrote:

 There is no obvious analog to a dummyCollector, precisely because
 Context is a *non-static* class.




 --
 Ted Dunning, CTO
 DeepDyve




Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
Nope, I was write the first time; they're invariant. I forgot the
@Override annotation.

Also, you can apparently create an inner class by saying

MyObj.InnerClass ic = myObj.new InnerClass(args);

This doesn't really get us very far though, without reflection.

-- David



On Wed, Jul 22, 2009 at 2:25 PM, David Halld...@cs.stanford.edu wrote:
 Oops, they are contravariant. Ok, so my proposed best practice is to
 make Mapper's take a MapContext and not a Mapper.Context, and I'll
 cook up a DummyContext real fast.

 -- David

 On Wed, Jul 22, 2009 at 2:22 PM, David Halld...@cs.stanford.edu wrote:
 Mapper.Context extends MapContext. The interfaces takes
 Mapper.Context, and IIRC method arguments aren't contravariant in
 Java.

 -- David

 On Wed, Jul 22, 2009 at 2:20 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Got it.

 Sorry to be dense.

 Is it really necessary to abuse reflection?  What about this:

 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/MapContext.html

 ?

 On Wed, Jul 22, 2009 at 2:15 PM, David Hall d...@cs.stanford.edu wrote:

 There is no obvious analog to a dummyCollector, precisely because
 Context is a *non-static* class.




 --
 Ted Dunning, CTO
 DeepDyve





Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
as in JMock? I have 0 experience with JMock, but I'll look into it.

-- David

On Wed, Jul 22, 2009 at 2:42 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Can you mock the object?  (that counts as using reflection and more, but is
 approved)

 On Wed, Jul 22, 2009 at 2:33 PM, David Hall d...@cs.stanford.edu wrote:

 Also, you can apparently create an inner class by saying

 MyObj.InnerClass ic = myObj.new InnerClass(args);

 This doesn't really get us very far though, without reflection.




[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-07-22 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: MAHOUT-123.patch

Everything fixed except adding an example.

What's the best way to include data with Mahout? I've never had luck 
autogenerating data for LDA.

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
EasyMock for the win.

Thanks for the suggestion!

No response on hadoop list, but  the mock seems like a fine solution
to me. In the simplest case you can just make sure that the call to
write is called K times, for whatever K, and in the advanced case you
can actually capture the outputs.

-- David

On Wed, Jul 22, 2009 at 4:05 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Or EasyMock.

 These are amazing libraries that actually twiddle byte code in some cases to
 emulate classes that would otherwise not be constructable.

 On Wed, Jul 22, 2009 at 2:46 PM, David Hall d...@cs.stanford.edu wrote:

 as in JMock? I have 0 experience with JMock, but I'll look into it.

 -- David

 On Wed, Jul 22, 2009 at 2:42 PM, Ted Dunningted.dunn...@gmail.com wrote:
  Can you mock the object?  (that counts as using reflection and more, but
 is
  approved)
 
  On Wed, Jul 22, 2009 at 2:33 PM, David Hall d...@cs.stanford.edu
 wrote:
 
  Also, you can apparently create an inner class by saying
 
  MyObj.InnerClass ic = myObj.new InnerClass(args);
 
  This doesn't really get us very far though, without reflection.
 
 




 --
 Ted Dunning, CTO
 DeepDyve



Re: Hadoop 0.20 and DummyOutputCollector

2009-07-22 Thread David Hall
Oh, they did propose a soln, slated for inclusion in 0.21.0

http://issues.apache.org/jira/browse/hadoop-5518
http://www.cloudera.com/hadoop-mrunit

However, these use the deprecated APIs... I think EasyMock might be
better here.

-- David

On Wed, Jul 22, 2009 at 5:24 PM, David Halld...@cs.stanford.edu wrote:
 EasyMock for the win.

 Thanks for the suggestion!

 No response on hadoop list, but  the mock seems like a fine solution
 to me. In the simplest case you can just make sure that the call to
 write is called K times, for whatever K, and in the advanced case you
 can actually capture the outputs.

 -- David

 On Wed, Jul 22, 2009 at 4:05 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Or EasyMock.

 These are amazing libraries that actually twiddle byte code in some cases to
 emulate classes that would otherwise not be constructable.

 On Wed, Jul 22, 2009 at 2:46 PM, David Hall d...@cs.stanford.edu wrote:

 as in JMock? I have 0 experience with JMock, but I'll look into it.

 -- David

 On Wed, Jul 22, 2009 at 2:42 PM, Ted Dunningted.dunn...@gmail.com wrote:
  Can you mock the object?  (that counts as using reflection and more, but
 is
  approved)
 
  On Wed, Jul 22, 2009 at 2:33 PM, David Hall d...@cs.stanford.edu
 wrote:
 
  Also, you can apparently create an inner class by saying
 
  MyObj.InnerClass ic = myObj.new InnerClass(args);
 
  This doesn't really get us very far though, without reflection.
 
 




 --
 Ted Dunning, CTO
 DeepDyve




Re: [jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-07-22 Thread David Hall
It's more as an example than a test. I autogenerate data for the
tests, which are there for sanity.

-- David

On Wed, Jul 22, 2009 at 6:26 PM, Ted Dunningted.dunn...@gmail.com wrote:
 IN a maven standard build, test data is often included under
 src/test/resources

 On Wed, Jul 22, 2009 at 5:23 PM, David Hall (JIRA) j...@apache.org wrote:

 What's the best way to include data with Mahout? I've never had luck
 autogenerating data for LDA.




 --
 Ted Dunning, CTO
 DeepDyve



[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-06-29 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: MAHOUT-123.patch

Ok, here's the updates for the vectors. I'll add a page to the wiki shortly.

As for testing, this is actually something I'd like some direction on. It's 
never been clear to me how to test the actual implementation of clustering 
algorithms in any meaningful way. Looking at the Dirichlet clusterer, all that 
it tests are that serialization works, that things aren't null, and that it 
outputs the right number of things. Serialization in this case doesn't seem 
terribly necessary since my model are just serialized writables. So... I 
should just add some basic sanity checks?

-- David

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
 MAHOUT-123.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-19 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-126:
--

Attachment: MAHOUT-123.patch

Ok, I'm going to call this a mostly functional patch.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-19 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-126:
--

Attachment: (was: MAHOUT-123.patch)

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-19 Thread David Hall
Ignore this. Wrong issue.

On Fri, Jun 19, 2009 at 12:59 AM, David Hall (JIRA)j...@apache.org wrote:

     [ 
 https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 David Hall updated MAHOUT-126:
 --

    Attachment: MAHOUT-123.patch

 Ok, I'm going to call this a mostly functional patch.

 Prepare document vectors from the text
 --

                 Key: MAHOUT-126
                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
             Project: Mahout
          Issue Type: New Feature
    Affects Versions: 0.2
            Reporter: Shashikant Kore
            Assignee: Grant Ingersoll
             Fix For: 0.2

         Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks.
 1. Create lucene index of the input  plain-text documents
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily.
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721346#action_12721346
 ] 

David Hall commented on MAHOUT-126:
---

That's not the only time. This constructor clearly lets certain things slip 
through.

{code}
  public CachedTermInfo(IndexReader reader, String field, int minDf, int 
maxDfPercent) throws IOException {
this.field = field;
TermEnum te = reader.terms(new Term(field, ));
int count = 0;
int numDocs = reader.numDocs();
double percent = numDocs * maxDfPercent / 100.0;
//Should we use a linked hash map so that we no terms are in order?
termEntries = new LinkedHashMapString, TermEntry();
do {
  Term term = te.term();
  if (term == null || term.field().equals(field) == false){
break;
  }
  int df = te.docFreq();
  if (df  minDf || df  percent){
continue;
  }
  TermEntry entry = new TermEntry(term.text(), count++, df);
  termEntries.put(entry.term, entry);
} while (te.next());
te.close();
{code}

My code is essentially Lucene's demo indexing code (IndexFiles.java and 
FileDocument.java: 
http://google.com/codesearch/p?hl=ensa=Ncd=1ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.javaq=org.apache.lucene.demo.IndexFiles
} except that I replaced
{code}doc.add(new Field(contents, new FileReader(f)));{code}

with
{code}   doc.add(new Field(contents, new 
FileReader(f),Field.TermVector.YES));{code}

I then ran {code} java -cp classpath org.apache.lucene.demo.IndexFiles 
/Users/dlwh/txt-reuters/ {code}

and then {code} java -cp classpath org.apache.mahout.utils.vectors.Driver 
--dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t 
/Users/dlwh/dict --weight TF {code}

For what's it worth, it gives a null on reuters, which is not usually a stop 
word, except that every single document ends with it, and so the IDF filtering 
above is catching it.



 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: MAHOUT-65

2009-06-18 Thread David Hall
oh, wow, nevermind. Vector implements writable.

Sorry everyone.

-- David

On Thu, Jun 18, 2009 at 12:19 PM, David Halld...@cs.stanford.edu wrote:
 actually, it looks like someone went to all the trouble to make both
 SparseVector and DenseVector have all the methods required by
 Writable, but they don't implement Writable.

 Could I just make Vector extend Writable?

 -- David

 On Thu, Jun 18, 2009 at 12:01 PM, David Halld...@cs.stanford.edu wrote:
 following up on my earlier email.

 Would anyone be interested in a compressed serialization for
 DenseVector/SparseVector that follows in the vein of
 hadoop.io.Writable? The space overhead for gson (parsing issues
 not-withstanding) is pretty high, and it wouldn't be terribly hard to
 implement a high-performance thing for vectors.

 -- David

 On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastmanj...@windwardsolutions.com 
 wrote:
 +1, you added name constructors that I didn't have and the equals/equivalent
 stuff. Ya, Gson makes it all pretty trivial once you grok it.


 Grant Ingersoll wrote:

 Shall I take that as approval of the approach?

 BTW, the Gson stuff seems like a winner for serialization.

 On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:

 You gonna commit your patch? I agree with shortening the class name in
 the JsonVectorAdapter and will do it once you commit ur stuff.
 Jeff










Re: MAHOUT-65

2009-06-18 Thread David Hall
How often does Mahout need the Comparable part for Vectors? Are
vectors commonly used as map output keys?

In terms of space efficiency, I'd bet it's probably a bit better than
a factor of two in the average case, especially for densevectors. The
gson format is storing both the int index and the double as raw
strings, plus whatever boundary characters.  The writable
implementation stores just the bytes of the double, plus a length.

-- David

On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote:
 +1 asWritableComparable is a simple implementation that uses asFormatString.
 It would be good to rewrite it for internal communication. A factor of two
 is still a factor of two.

 Jeff


 Grant Ingersoll wrote:

 On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:

 Writable should be plenty!


 +1.  Still nice to have JSON for user facing though.

 On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote:

 See my followup on another thread (sorry for the schizophrenic
 posting); Vector already implements Writable, so that's all I really
 can ask of it. Is there something more you'd like? I'd be happy to do
 it.










Re: MAHOUT-65

2009-06-18 Thread David Hall
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastmanj...@windwardsolutions.com wrote:
 Shall I change the method to asWritable()?

I'd just be for getting rid of it. Vector implements Writable, so
asWritable() could just be return this;, which seems gratuitous

As for actual efficiency:
   
lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java

is currently dumping output values as the text strings. If there's a
standard dataset, that would be an easy place to do the test.

- David

 I don't know of any situations where Vectors are used as keys. It hardly
 makes sense to use them as they are so unwieldy. Suggest we could change to
 just Writable and be ahead. In terms of the potential density improvement,
 it will be interesting to see what can typically be achieved.

 r786323 just removed all calls to asWritableComparable, replacing them with
 asFormatString which was correct anyway.



 Jeff

 David Hall wrote:

 How often does Mahout need the Comparable part for Vectors? Are
 vectors commonly used as map output keys?

 In terms of space efficiency, I'd bet it's probably a bit better than
 a factor of two in the average case, especially for densevectors. The
 gson format is storing both the int index and the double as raw
 strings, plus whatever boundary characters.  The writable
 implementation stores just the bytes of the double, plus a length.

 -- David

 On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com
 wrote:


 +1 asWritableComparable is a simple implementation that uses
 asFormatString.
 It would be good to rewrite it for internal communication. A factor of
 two
 is still a factor of two.

 Jeff


 Grant Ingersoll wrote:


 On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:



 Writable should be plenty!



 +1.  Still nice to have JSON for user facing though.



 On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu
 wrote:



 See my followup on another thread (sorry for the schizophrenic
 posting); Vector already implements Writable, so that's all I really
 can ask of it. Is there something more you'd like? I'd be happy to do
 it.















Who runs the github clone of Mahout?

2009-06-17 Thread David Hall
I'd like to request it be updated to svn's head...

Thanks,
David


Re: Who runs the github clone of Mahout?

2009-06-17 Thread David Hall
http://github.com/apache/mahout/

Researching it a little, it seems to be run some kind of
auto-mirroring. A lot of (all? most?) apache projects are there, but
they all haven't been updated since june 10.

-- David

On Wed, Jun 17, 2009 at 10:53 AM, Grant Ingersollgsing...@apache.org wrote:
 There's a github clone of Mahout?

 On Jun 17, 2009, at 1:20 PM, David Hall wrote:

 I'd like to request it be updated to svn's head...

 Thanks,
 David





[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-17 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720816#action_12720816
 ] 

David Hall commented on MAHOUT-126:
---

LuceneIteratable (is that an intentional pun?) has behavior that isn't 
documented well. Namely, if the normless constructor is called, the norm 
defaults to 2.

This has the consequence that not passing in a norm to Driver L2 normalizes the 
vectors. You have to specify a negative double != -1.0 to get unnormalized 
counts. Relatedly, -1 maps to the L2 norm. This is odd behavior to me, or it 
should at least be documented. (The wiki article implies there's a difference 
between using --norm 2 and using no norm at all.)

Also, I'd like an option to tell Driver what weight object to use. I can do the 
patch for this.

Thanks!

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-17 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721068#action_12721068
 ] 

David Hall commented on MAHOUT-126:
---

Ok, I'm probably misunderstanding something, or there could be a bug. I 
modified Lucene's demo indexer to store a term vector. It's still crashing. I 
added a series of printlns before TermVector.java:65 and CachedTermInfo:71, and 
I end up with the assertion here failing:

{{
 @Override
  public TermEntry getTermEntry(String field, String term) {
if (this.field.equals(field) == false){ return null;}
TermEntry ret =  termEntries.get(term);
assert(ret != null); // This assertion is firing.
return ret;
  }
}}

In my dataset, this happens after several hundred iterations. The term is a 
stop-word for the corpus in question, and it looks like there's an attempt at 
stopwording earlier in the file. Maybe these are not interacting well?

-- David

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

2009-06-17 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-126:
--

Attachment: MAHOUT-126-null-entry.patch

I'm going to assume that's the problem. The attached patch just skips over any 
null term vectors. It seems like reasonable behavior here, given the filtering.



 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-06-16 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: lda.patch

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation. NIPS, 2007.
 Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
 continuous-time model of topical trends. KDD, 2006
 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
 large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-06-16 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Fix Version/s: 0.2
Affects Version/s: 0.2
   Status: Patch Available  (was: Open)

This is a roughcut implementation. Not ready to go yet. I've been waiting on 
MAHOUT-126 because it seems like the way to create the Vectors I need. Or 
perhaps there's a better way.

Basic approach follows the Dirichlet implementation. There is a driver class 
(LDA Driver) which runs K mapreduces, and a Mapper and a Reducer. We also have 
an Inferencer, which is what the Mapper uses to compute expected sufficient 
statistics. A document is just a V-dimensional sparse vector of word counts.

Map: Perform Inference on each document (~ E-step) and output log probabilities 
of p(word|topic)
Reduce: logSum the input log probabilities (~ M-Step), and output the result.

Loop: use the results of the reduce as the log probabilities for the map.

Remaining:
1) Actually run the thing
2) Number-of-non-zero elements in a sparse vector. Is that staying size?
3) Allow for computing of likelihood to determine when we're done.
4) What's the status of serializing as sparse vector and reading as a dense 
vector? Is that going to happen?
5) Find a fun data set to bundle...
6) Convenience method for running just inference on a set of documents and 
outputting MAP estimates of word probabilities.

 Implement Latent Dirichlet Allocation
 -

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
Reporter: David Hall
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: lda.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 (For GSoC)
 Abstract:
 Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
 algorithm for automatically and jointly clustering words into topics
 and documents into mixtures of topics, and it has been successfully
 applied to model change in scientific fields over time (Griffiths and
 Steyver, 2004; Hall, et al. 2008). In this project, I propose to
 implement a distributed variant of Latent Dirichlet Allocation using
 MapReduce, and, time permitting, to investigate extensions of LDA and
 possibly more efficient algorithms for distributed inference.
 Detailed Description:
 A topic model is, roughly, a hierarchical Bayesian model that
 associates with each document a probability distribution over
 topics, which are in turn distributions over words. For instance, a
 topic in a collection of newswire might include words about sports,
 such as baseball, home run, player, and a document about steroid
 use in baseball might include sports, drugs, and politics. Note
 that the labels sports, drugs, and politics, are post-hoc labels
 assigned by a human, and that the algorithm itself only assigns
 associate words with probabilities. The task of parameter estimation
 in these models is to learn both what these topics are, and which
 documents employ them in what proportions.
 One of the promises of unsupervised learning algorithms like Latent
 Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
 massive collections of documents and condense them down into a
 collection of easily understandable topics. However, all available
 open source implementations of LDA and related topics models are not
 distributed, which hampers their utility. This project seeks to
 correct this shortcoming.
 In the literature, there have been several proposals for paralellzing
 LDA. Newman, et al (2007) proposed to create an approximate LDA in
 which each processors gets its own subset of the documents to run
 Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
 its very nature, which is not advantageous for repeated runs. Instead,
 I propose to follow Nallapati, et al. (2007) and use a variational
 approximation that is fast and non-random.
 References:
 David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
 David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
 allocation, The Journal of Machine Learning Research, 3, p.993-1022,
 3/1/2003
 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
 Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
 David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
 the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
 Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
 variational EM for Latent Dirichlet Allocation: An experimental
 evaluation of speed and scalability, ICDM workshop on high performance
 data mining, 2007.
 Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
 Inference for Latent Dirichlet Allocation

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-05-29 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714362#action_12714362
 ] 

David Hall commented on MAHOUT-126:
---

 Sure, I just want to be able to have:

 double weight = similarity.tf(termFreq) * similarity.idf(docFreq, numDocs);

be this instead:

double weight = termFreq

based on some configuration or another. (Maybe if I can just pass in a custom 
Similarity object? Or there could be a protected method createSimilarity 
that I could override?)

Basically, LDA wants raw counts (or at least, some kind of integers).

Thanks!


 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Reporter: Shashikant Kore
 Attachments: MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: digamma function now in commons math

2009-05-25 Thread David Hall
Sounds good. Thanks for taking the time to do that! Should I add a
dependency on math 2.0-SNAPSHOT then? It seems unlikely to cause any
problems, except in my own code.

On Mon, May 25, 2009 at 11:00 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 David,

 The commons math team accepted my patch and have committed it to trunk.  The
 version that they have includes better test cases and a trigramma function.
 This patch will be part of the 2.0 release which is still some time away.

 --
 Ted Dunning, CTO
 DeepDyve



[jira] Created: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-05-23 Thread David Hall (JIRA)
Implement Latent Dirichlet Allocation
-

 Key: MAHOUT-123
 URL: https://issues.apache.org/jira/browse/MAHOUT-123
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Reporter: David Hall


(For GSoC)

Abstract:

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
algorithm for automatically and jointly clustering words into topics
and documents into mixtures of topics, and it has been successfully
applied to model change in scientific fields over time (Griffiths and
Steyver, 2004; Hall, et al. 2008). In this project, I propose to
implement a distributed variant of Latent Dirichlet Allocation using
MapReduce, and, time permitting, to investigate extensions of LDA and
possibly more efficient algorithms for distributed inference.

Detailed Description:

A topic model is, roughly, a hierarchical Bayesian model that
associates with each document a probability distribution over
topics, which are in turn distributions over words. For instance, a
topic in a collection of newswire might include words about sports,
such as baseball, home run, player, and a document about steroid
use in baseball might include sports, drugs, and politics. Note
that the labels sports, drugs, and politics, are post-hoc labels
assigned by a human, and that the algorithm itself only assigns
associate words with probabilities. The task of parameter estimation
in these models is to learn both what these topics are, and which
documents employ them in what proportions.

One of the promises of unsupervised learning algorithms like Latent
Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
massive collections of documents and condense them down into a
collection of easily understandable topics. However, all available
open source implementations of LDA and related topics models are not
distributed, which hampers their utility. This project seeks to
correct this shortcoming.

In the literature, there have been several proposals for paralellzing
LDA. Newman, et al (2007) proposed to create an approximate LDA in
which each processors gets its own subset of the documents to run
Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
its very nature, which is not advantageous for repeated runs. Instead,
I propose to follow Nallapati, et al. (2007) and use a variational
approximation that is fast and non-random.


References:

David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.

David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
allocation, The Journal of Machine Learning Research, 3, p.993-1022,
3/1/2003

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.

David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.

Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
variational EM for Latent Dirichlet Allocation: An experimental
evaluation of speed and scalability, ICDM workshop on high performance
data mining, 2007.

Newman, D., Asuncion, A., Smyth, P.,  Welling, M. Distributed
Inference for Latent Dirichlet Allocation. NIPS, 2007.


Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
continuous-time model of topical trends. KDD, 2006


Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Special functions

2009-05-23 Thread David Hall
Hi,

For my project, I need to have an impl of the digamma function:
http://en.wikipedia.org/wiki/Digamma_function

Apache commons math doesn't have it (oddly), so I need to acquire it
from somewhere else.

I trust Radford Neal, who wrote the implementation here:

http://google.com/codesearch/p?hl=en#EbB356_xxkI/fbm.2003-06-29/util/digamma.c

The license seems more than permissive enough...

Alternatively, I can try to track down a book (Numerical Recipes?)
with pseudocode.

-- David


Re: Special functions

2009-05-23 Thread David Hall
Share is too strong. He released a number of functions in the
library I linked to, and the only requirement of the license seems to
be we maintain the copyright notice and say what we changed:

/* Copyright (c) 1995-2003 by Radford M. Neal
 *
 * Permission is granted for anyone to copy, use, modify, or distribute this
 * program and accompanying programs and documents for any purpose, provided
 * this copyright notice is retained and prominently displayed, along with
 * a note saying that the original programs are available from Radford Neal's
 * web page, and note is made of any changes made to the programs.  The
 * programs and documents are distributed without any warranty, express or
 * implied.  As the programs were written for research purposes only, they have
 * not been tested to the degree that would be advisable in any important
 * application.  All use of these programs is entirely at the user's own risk.
 */

I can also just email him directly.

-- David


On Sat, May 23, 2009 at 2:22 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 Avoid Numerical Recipes if you want to avoid license issues.  Their
 publisher has a strong history of being very strict about their
 interpretation of what they think they own.

 If Radford Neal has an implementation that he would share, I would count
 that as a great contribution.

 On Sat, May 23, 2009 at 2:09 PM, David Hall d...@cs.stanford.edu wrote:


 Alternatively, I can try to track down a book (Numerical Recipes?)
 with pseudocode.




 --
 Ted Dunning, CTO
 DeepDyve



Re: Special functions

2009-05-23 Thread David Hall
Relatedly, I need an implementation of logGamma, which is available in
apache commons math. Can I add a dependency?

-- David

On Sat, May 23, 2009 at 2:26 PM, David Hall d...@cs.stanford.edu wrote:
 Share is too strong. He released a number of functions in the
 library I linked to, and the only requirement of the license seems to
 be we maintain the copyright notice and say what we changed:

 /* Copyright (c) 1995-2003 by Radford M. Neal
  *
  * Permission is granted for anyone to copy, use, modify, or distribute this
  * program and accompanying programs and documents for any purpose, provided
  * this copyright notice is retained and prominently displayed, along with
  * a note saying that the original programs are available from Radford Neal's
  * web page, and note is made of any changes made to the programs.  The
  * programs and documents are distributed without any warranty, express or
  * implied.  As the programs were written for research purposes only, they 
 have
  * not been tested to the degree that would be advisable in any important
  * application.  All use of these programs is entirely at the user's own risk.
  */

 I can also just email him directly.

 -- David


 On Sat, May 23, 2009 at 2:22 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 Avoid Numerical Recipes if you want to avoid license issues.  Their
 publisher has a strong history of being very strict about their
 interpretation of what they think they own.

 If Radford Neal has an implementation that he would share, I would count
 that as a great contribution.

 On Sat, May 23, 2009 at 2:09 PM, David Hall d...@cs.stanford.edu wrote:


 Alternatively, I can try to track down a book (Numerical Recipes?)
 with pseudocode.




 --
 Ted Dunning, CTO
 DeepDyve




Re: Special functions

2009-05-23 Thread David Hall
Thanks!

-- David

On Sat, May 23, 2009 at 4:44 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 David,

 Actually, I just looked around and didn't see much interesting and cleanly
 available along this line so I just wrote a digamma function.

 See https://issues.apache.org/jira/browse/MATH-267 for a tar file containing
 an implementation with test cases.  I went ahead and copyrighted this for
 apache use.  It contains source, comments and test values derived from
 mathematica.

 In the process, I discovered that the R implementation of digamma is really
 crappy for medium small positive values of x.

 On Sat, May 23, 2009 at 2:26 PM, David Hall d...@cs.stanford.edu wrote:

 Share is too strong. He released a number of functions in the
 library I linked to, and the only requirement of the license seems to
 be we maintain the copyright notice and say what we changed:

 /* Copyright (c) 1995-2003 by Radford M. Neal
  *
  * Permission is granted for anyone to copy, use, modify, or distribute
 this
  * program and accompanying programs and documents for any purpose,
 provided
  * this copyright notice is retained and prominently displayed, along with
  * a note saying that the original programs are available from Radford
 Neal's
  * web page, and note is made of any changes made to the programs.  The
  * programs and documents are distributed without any warranty, express or
  * implied.  As the programs were written for research purposes only, they
 have
  * not been tested to the degree that would be advisable in any important
  * application.  All use of these programs is entirely at the user's own
 risk.
  */

 I can also just email him directly.

 -- David


 On Sat, May 23, 2009 at 2:22 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  Avoid Numerical Recipes if you want to avoid license issues.  Their
  publisher has a strong history of being very strict about their
  interpretation of what they think they own.
 
  If Radford Neal has an implementation that he would share, I would count
  that as a great contribution.
 
  On Sat, May 23, 2009 at 2:09 PM, David Hall d...@cs.stanford.edu
 wrote:
 
 
  Alternatively, I can try to track down a book (Numerical Recipes?)
  with pseudocode.
 
 
 
 
  --
  Ted Dunning, CTO
  DeepDyve
 




 --
 Ted Dunning, CTO
 DeepDyve

 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 http://www.deepdyve.com
 858-414-0013 (m)
 408-773-0220 (fax)



Re: neural network

2009-05-07 Thread David Hall
Map-Reduce for Machine Learning on Multicore
by: Cheng T Chu, Sang K Kim, Yi A Lin, Yuanyuan Yu, Gary R Bradski,
Andrew Y Ng, Kunle Olukotun
edited by: Bernhard Schölkopf, John C Platt, Thomas Hoffman

http://www.citeulike.org/user/zzztimbo/article/2308503


On Thu, May 7, 2009 at 1:42 PM, Danny-Michael Busch da...@kurbel.net wrote:
 Ted Dunning schrieb:

 I don't think that anybody has done any serious work on this yet.


 I was starting here: http://cwiki.apache.org/MAHOUT/neural-network.html
 However, I could not found the mentioned paper on the nips.cc website -
 could anyone give me a hint where this paper could be found?

 Thanks,
 Danny

 --
 KURBEL
 Softwareentwicklung
  IT - Beratung
 Danny-Michael Busch
 Wilhelmstraße 2
 D-35392 Gießen
 Tel.: (01520) 849 8469
 http://www.kurbel.net





Re: [GSOC] Accepted Students

2009-04-23 Thread David Hall
Thanks everyone!

-- David

On Thu, Apr 23, 2009 at 12:53 PM, Grant Ingersoll gsing...@apache.org wrote:
 It's also helpful to get yourself a Wiki account and a JIRA account if you
 don't already have them.  Small patches to the existing docs/code can also
 help you figure out the process


 On Apr 21, 2009, at 1:19 PM, Isabel Drost wrote:

 On Tuesday 21 April 2009 08:30:34 David Hall wrote:

 As for questions, what am I supposed to be reading during this
 community building period? I see:

 * http://cwiki.apache.org/MAHOUT/howtocontribute.html
 * http://www.apache.org/foundation/how-it-works.html

 plus skimming javadocs.

 These are certainly of interest.

 In addition you can checkout and have a look at the code. Try to get a
 rough
 idea of where your contribution would fit best. Please share your ideas
 with
 the community to get feedback early on.




Re: Introduction for student interested in GSoC

2009-03-31 Thread David Hall
Here's a followup proposal (submitted to GSOC's site. I will add it to
the wiki, but I'm having trouble accessing it right now)

Thanks!

-- David

Title/Summary: Distributed Latent Dirichlet Allocation

Student: David Hall

Student e-mail: d...@cs.stanford.edu


Student Major: Symbolic Systems/ Computer Science

Student Degree: MS/PhD

Student Graduation:  Stanford '09 / Berkeley '14


Organization: Hadoop

Assigned Mentor: Grant Ingersoll


Abstract:

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
algorithm for automatically and jointly clustering words into topics
and documents into mixtures of topics, and it has been successfully
applied to model change in scientific fields over time (Griffiths and
Steyver, 2004; Hall, et al. 2008). In this project, I propose to
implement a distributed variant of Latent Dirichlet Allocation using
MapReduce, and, time permitting, to investigate extensions of LDA and
possibly more efficient algorithms for distributed inference.

Detailed Description:

A topic model is, roughly, a hierarchical Bayesian model that
associates with each document a probability distribution over
topics, which are in turn distributions over words. For instance, a
topic in a collection of newswire might include words about sports,
such as baseball, home run, player, and a document about steroid
use in baseball might include sports, drugs, and politics. Note
that the labels sports, drugs, and politics, are post-hoc labels
assigned by a human, and that the algorithm itself only assigns
associate words with probabilities. The task of parameter estimation
in these models is to learn both what these topics are, and which
documents employ them in what proportions.

One of the promises of unsupervised learning algorithms like Latent
Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
massive collections of documents and condense them down into a
collection of easily understandable topics. However, all available
open source implementations of LDA and related topics models are not
distributed, which hampers their utility. This project seeks to
correct this shortcoming.

In the literature, there have been several proposals for paralellzing
LDA. Newman, et al (2007) proposed to create an approximate LDA in
which each processors gets its own subset of the documents to run
Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
its very nature, which is not advantageous for repeated runs. Instead,
I propose to follow Nallapati, et al. (2007) and use a variational
approximation that is fast and non-random.

From there, I would like extend LDA to either Supervised Topic Models
(Blei  McAulliffe, 2007) or Topics over Time (Wang  McCallum, 2006).
The former would enable LDA to be used for basic classification tasks,
and the latter would model topic dynamics explicitly.  For instance, a
politics topic is not static: certain words are more important over
time and the prominence of sports itself rises and falls around
playoff schedules and the like. A basic implementation of either of
these would be reasonably straightforward extension to LDA, and they
would also prove the flexibility of my implementation.

Finally, time permitting, I would like to examine the more efficient
algorithms using junction trees proposed by Wolfe, et al. (2008). They
demonstrate substantial speed up over the naive implementation
proposed earlier, but the framework does not fit as easily into
standard map-reduce architecture as implemented by Hadoop. I
anticipate that this work will be more exploratory in nature, with a
focus on laying the ground work for improvements more than a polished
implementation.

Biography:

I am a graduating masters student in Symbolic Systems from Stanford
University, and I will begin work on the PhD at UC Berkeley next
autumn. I am currently a member of Stanford's Natural Language
Processing group under Dan Jurafsky and Chris Manning. (I will be
working Dan Klein the fall.) My research has involved the application
of topic models to modeling and discovering the origins of scientific
paradigms, breakthroughs that change a scientific field or even
create a new field. More recently, I've worked on minimally
unsupervised part of speech tagging.


In terms of my experience with Hadoop and Map Reduce, I have interned
at Google for two summers, working with MapReduce for the entirety of
both internships. My second summer I worked on Google's Machine
Translation team, and so I am familiar with using Map Reduce to
implement large-scale NLP and Machine Learning algorithms.

More recently, I've been in charge of setting up and maintaining a
small Hadoop cluster in my research group, and I've written a wrapper
library for Hadoop in the Scala Programming Language. The code isn't
quite release quality yet, but you can see its work-in-progress
state at http://bugs.scalanlp.org/repositories/show/smr , and I've
written a short blog post about it at
http://scala-blogs.org

Re: Introduction for student interested in GSoC

2009-03-24 Thread David Hall
On Tue, Mar 24, 2009 at 4:15 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 This sounds fantastic.

 I think that your scala code is interesting, but your thoughts on LDA are
 much more so.  I tried doing a similar simplification of map-reduce program
 writing using groovy and found that in spite of even smaller programs than
 you quote for word-count, that the benefits in practice were relatively
 small.  Using Pig was much more productive, even with the lack of any real
 programming language.

Thanks! I agree that SMR isn't there yet, and it really isn't a Mahout
thing. I could get closer to the Groovy line count, but my main goal
was to remove all the boiler plate associated with Hadoop
(Text,IntWritable,Mapper/Reducer) and to get closer to the real
program logic. You are right that Pig is usually more useful for many
tasks, and one of my plans is to duplicate some of its functionality,
though I actually think I prefer Dryad/LINQ's kind of syntax.


 It would also be interesting to see how you might attack semi-supervised
 multi-task learning using a well-founded Bayesian approach.  For a
 non-Bayesian example with impressive results, see Ronan Collobert's paper:
 http://ronan.collobert.com/pub/2008_nlp_icml.html

Interesting. I'll take a closer look at this this evening.

-- David


 On Tue, Mar 24, 2009 at 12:26 AM, David Hall d...@cs.stanford.edu wrote:

 This summer, I'd like to help contribute to the Mahout project. I read
 Tijs Zwinkels' proposal, and I think that what I would like to work on
 is sufficiently different from what he would like to do. First, I
 would like to implement Latent Dirichilet Allocation, a popular topic
 mixture model that learns both document clusters and word clusters. I
 would then like to extend it to implement a number of general purpose
 topic models, including Topics over Time, Pachinko Allocation, and
 possibly Supervised Topic Models.




 --
 Ted Dunning, CTO
 DeepDyve

 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 www.deepdyve.com
 408-773-0110 ext. 738
 858-414-0013 (m)
 408-773-0220 (fax)



Re: Introduction for student interested in GSoC

2009-03-24 Thread David Hall
On Tue, Mar 24, 2009 at 4:34 PM, David Hall d...@cs.stanford.edu wrote:
 On Tue, Mar 24, 2009 at 4:15 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 It would also be interesting to see how you might attack semi-supervised
 multi-task learning using a well-founded Bayesian approach.  For a
 non-Bayesian example with impressive results, see Ronan Collobert's paper:
 http://ronan.collobert.com/pub/2008_nlp_icml.html

 Interesting. I'll take a closer look at this this evening.

Actually, my officemate's dissertation project is very closely related
to this, except using parsing as a base. That is to say, I probably
shouldn't work on it, because I'd be stepping on her toes...

-- David