Re: forrest 1, benson 0
I had this problem too, though I guess at the time I was running normal leopard, and not snow leopard. This link might help? http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard -- David On Wed, Jan 13, 2010 at 3:08 PM, Benson Margulies bimargul...@gmail.com wrote: forrest does not support Java 1.6. MacOSX does not support Java 1.5.
[jira] Commented: (MAHOUT-227) Parallel SVM
[ https://issues.apache.org/jira/browse/MAHOUT-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793053#action_12793053 ] David Hall commented on MAHOUT-227: --- As Ted hints, a proposal should really be placed on the wiki. http://cwiki.apache.org/MAHOUT/ Looking forward to it. Parallel SVM Key: MAHOUT-227 URL: https://issues.apache.org/jira/browse/MAHOUT-227 Project: Mahout Issue Type: Task Components: Classification Reporter: zhao zhendong Attachments: svmProposal.patch I wrote a proposal of parallel algorithm for SVM training. Any comment is welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SVM algo, code, etc.
On Fri, Dec 11, 2009 at 5:02 AM, Jake Mannix jake.man...@gmail.com wrote: I really feel like I should respond to this, but seeing as I live on the west coast of the US, going to bed might be more advisable. On a very specific topic of SVMs, I can certainly look into this, but David, were you interested in helping bring this into Mahout and help maintain it? You are often rather quiet on here, yet happened to jump in as this topic came up? Yeah, the first semester of the PhD program has been far more busy than I imagined, and I've been overwhelmed. (Now it's finals week.) Online optimization has kind of caught my eye of late, with Pegasos being something I had been thinking about implementing. I would be glad to get this up and running, though I'd like more to help curate a patch. Pegasos or no. -- David -jake On Fri, Dec 11, 2009 at 4:40 AM, Sean Owen sro...@gmail.com wrote: This is a timely message, since I'm currently presuming to close some old Mahout issues at the moment and it raises a related concern. There's lots of old JIRA issues of the form: 1) somebody submits a patch implementing part of something 2) some comments happen, maybe 3) nothing happens for a year 4) I close it now At an early stage, this is fine actually. 20 people contribute at the start; 3 select themselves naturally as regular contributors. 20 patches go up; the 5 that are of use an interest naturally get picked up and eventually committed. But going forward, this probably won't do. Potential committers get discouraged and work goes wasted. (See comments about Commons Math on this list for an example of the fallout.) I wonder what the obstacles are to avoiding this? 1) Do we need to be clearer about what the project is and isn't about? What the priorities are, what work is already on the table to be done? This is why I am keen on cleaning up JIRA now; it's hard for even us to understand what's in progress, what's important, 2) Do we need some more official ownership or responsibility for components? For example I am not sure who would manage changes to Clustering stuff. I know it isn't me; I don't know about that part. So what happens to an incoming patch to clustering? While too much command-and-control isn't possible or desirable in open source, lack of it is harmful too. I don't think the answer is just let people commit bits and bobs since it makes the project appear to be a workbench of half-finished jobs, which does a disservice to the components that are polished. I have no reason to believe this SVM patch, should it materialize, would fall through the cracks in this way, but want to ask now how we can just make sure. So, can we answer: 1) Is SVM in scope for Mahout? (I am guessing so.) 2) Who is nominally committing to shepherd the code into the code base and fix bugs and answer questions? (Jake?) I'm not really bothered about this particular patch, but the more general question.
[jira] Assigned: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper
[ https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall reassigned MAHOUT-197: - Assignee: David Hall LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper -- Key: MAHOUT-197 URL: https://issues.apache.org/jira/browse/MAHOUT-197 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.2 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 0.20.1, build from r834311 of http://svn.apache.org/repos/asf/lucene/mahout/trunk Reporter: Drew Farris Assignee: David Hall Priority: Minor Attachments: LDADriver-setJar.patch hadoop jar core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 --maxIter 2 [...] 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). [...] 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005 09/11/09 22:02:02 INFO mapred.JobClient: map 0% reduce 0% 09/11/09 22:02:12 INFO mapred.JobClient: Task Id : attempt_200911091316_0005_m_00_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:200) Can be fixed by adding the following line to LDADriver after line 299 in r831743: job.setJarByClass(LDADriver.class); (will attach trivial patch) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper
[ https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-197: -- Resolution: Fixed Status: Resolved (was: Patch Available) Fixed in 887843 Thanks for the patch! LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper -- Key: MAHOUT-197 URL: https://issues.apache.org/jira/browse/MAHOUT-197 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.2 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 0.20.1, build from r834311 of http://svn.apache.org/repos/asf/lucene/mahout/trunk Reporter: Drew Farris Assignee: David Hall Priority: Minor Attachments: LDADriver-setJar.patch hadoop jar core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 --maxIter 2 [...] 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). [...] 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005 09/11/09 22:02:02 INFO mapred.JobClient: map 0% reduce 0% 09/11/09 22:02:12 INFO mapred.JobClient: Task Id : attempt_200911091316_0005_m_00_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:200) Can be fixed by adding the following line to LDADriver after line 299 in r831743: job.setJarByClass(LDADriver.class); (will attach trivial patch) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SVM algo, code, etc.
On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost isa...@apache.org wrote: On Fri Grant Ingersoll gsing...@apache.org wrote: On Nov 19, 2009, at 1:15 PM, Sean Owen wrote: Post a patch if you'd like to proceed, IMHO. +1 +1 from me as well. I would love to see solid svm support in Mahout. And another +1 from me. If you want a pointer, I've recently stumbled on a new solver for SVMs that seems to be remarkably easy to implement. It's called Pegasos: ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf -- David
Re: SVM algo, code, etc.
On Thu, Dec 3, 2009 at 2:12 AM, Olivier Grisel olivier.gri...@ensta.org wrote: 2009/12/3 Ted Dunning ted.dunn...@gmail.com: Very interesting results, particularly the lack of dependence on data size. On Thu, Dec 3, 2009 at 12:02 AM, David Hall d...@cs.berkeley.edu wrote: On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost isa...@apache.org wrote: On Fri Grant Ingersoll gsing...@apache.org wrote: On Nov 19, 2009, at 1:15 PM, Sean Owen wrote: Post a patch if you'd like to proceed, IMHO. +1 +1 from me as well. I would love to see solid svm support in Mahout. And another +1 from me. If you want a pointer, I've recently stumbled on a new solver for SVMs that seems to be remarkably easy to implement. It's called Pegasos: ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdfhttp://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf Pegasos and other online implementations of SVMs based on regularized variants of stochastic gradient descent are indeed amenable to large scale problems. They solve the SVM optimization problem with a stochastic approximation of the primal (as opposed to more 'classical' solvers such as libsvm that solve the dual problem using Sequential Minimal Optimization). However SGD based SVM implementation are currently limited to the linear 'kernel' (which is often expressive enough for common NLP tasks such as document categorization). You seem to know far more about this than I do, but the paper I linked to specifically says (page 6) that they can use Mercer kernels. -- David
Re: LDA for multi label classification was: Mahout Book
On Fri, Oct 16, 2009 at 4:08 AM, zhao zhendong zhaozhend...@gmail.com wrote: I have seen the implementation of L-LDA using Java, Stanford Topic Modeling Toolbox http://nlp.stanford.edu/software/tmt/ Does any one know whether they provide the source code or not? I'm pretty sure it's scala, no? It's definitely open source. Like I said, however, this implementation is almost certainly Gibbs sampling based, which has consequences for parallelization (or rather, the Rao-Blackwellization does.) -- David Thanks, Maxim On Fri, Oct 16, 2009 at 12:39 PM, David Hall d...@cs.berkeley.edu wrote: Sorry, this slipped out of my inbox and I just found it! On Thu, Oct 8, 2009 at 12:05 PM, Robin Anil robin.a...@gmail.com wrote: Posting to the dev list. Great Paper Thanks!. Looks like L-LDA could be used to create some interesting examples. Thanks! The Paper shows L-LDA could be used to creating word-tag model for accurate tag(s) prediction given a document of words. I will complete reading and tell How much work is need to transform/build on top of current LDA implementation to L-LDA. any thoughts? Umm, cool! In the paper we used Gibbs sampling to do the inference, and the implementation in Mahout uses variational inference (because it distributes better). I don't see any obvious problems in terms of math, and so the rest is just fitting it in the system. I think a small amount of refactoring would be in order to make things more generic, and then it shouldn't be too hard to plug in. I'll add it to my list, but I'm swamped for quite some time. -- David Robin On Thu, Oct 8, 2009 at 11:50 PM, David Hall d...@cs.berkeley.edu wrote: The short answer is, that it probably won't help all that much. Naive Bayes is unreasonably good when you have enough data. The long answer is, I have a paper with Dan Ramage and Ramesh Nallapati that talks about how to do it. www.aclweb.org/anthology-new/D/D09/D09-1026.pdf In some sense, Labeled-LDA is a kind of Naive Bayes where you can have more than one class per document. If you have exactly one class per document, then LDA reduces to Naive Bayes (or the unsupervised variant of naive bayes which is basically k-means in multinomial space). If instead you wanted to project W words to K topics, with K numWords, then there is something to do... That something is something like: 1) get p(topic|word,document) for each word in each document (which is output by LDAInference). Those are your expected counts for each topic. 2)For each class, do something like: p(topic|class) \propto \sum_{document with that class,word} p(topic|word,document) Then just apply bayes rule to do classification: p(class|topics,document) \propto p(class) \prod p(topic|class,document) -- David On Thu, Oct 8, 2009 at 11:07 AM, Robin Anil robin.a...@gmail.com wrote: Thanks. Didnt see that, Fixed it!. I have a query How is the LDA topic model used to improve a classifier. Say Naive Bayes? If its possible, then I would like to integrate it into mahout. Given m classes and the associated documents, One can build m topic models right. (set of topics(words) under each label and the associated probability distribution of words). How can i use that info weight the most relevant topic of a class ? LDA has two meanings: linear discriminant analysis and latent dirichlet allocation. My code is the latter. The former is a kind of classification. You say linear discriminant analysis in the outline. -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore Homepage:http://zhaozhendong.googlepages.com Mail: zhaozhend...@gmail.com
Re: LDA for multi label classification was: Mahout Book
Sorry, this slipped out of my inbox and I just found it! On Thu, Oct 8, 2009 at 12:05 PM, Robin Anil robin.a...@gmail.com wrote: Posting to the dev list. Great Paper Thanks!. Looks like L-LDA could be used to create some interesting examples. Thanks! The Paper shows L-LDA could be used to creating word-tag model for accurate tag(s) prediction given a document of words. I will complete reading and tell How much work is need to transform/build on top of current LDA implementation to L-LDA. any thoughts? Umm, cool! In the paper we used Gibbs sampling to do the inference, and the implementation in Mahout uses variational inference (because it distributes better). I don't see any obvious problems in terms of math, and so the rest is just fitting it in the system. I think a small amount of refactoring would be in order to make things more generic, and then it shouldn't be too hard to plug in. I'll add it to my list, but I'm swamped for quite some time. -- David Robin On Thu, Oct 8, 2009 at 11:50 PM, David Hall d...@cs.berkeley.edu wrote: The short answer is, that it probably won't help all that much. Naive Bayes is unreasonably good when you have enough data. The long answer is, I have a paper with Dan Ramage and Ramesh Nallapati that talks about how to do it. www.aclweb.org/anthology-new/D/D09/D09-1026.pdf In some sense, Labeled-LDA is a kind of Naive Bayes where you can have more than one class per document. If you have exactly one class per document, then LDA reduces to Naive Bayes (or the unsupervised variant of naive bayes which is basically k-means in multinomial space). If instead you wanted to project W words to K topics, with K numWords, then there is something to do... That something is something like: 1) get p(topic|word,document) for each word in each document (which is output by LDAInference). Those are your expected counts for each topic. 2)For each class, do something like: p(topic|class) \propto \sum_{document with that class,word} p(topic|word,document) Then just apply bayes rule to do classification: p(class|topics,document) \propto p(class) \prod p(topic|class,document) -- David On Thu, Oct 8, 2009 at 11:07 AM, Robin Anil robin.a...@gmail.com wrote: Thanks. Didnt see that, Fixed it!. I have a query How is the LDA topic model used to improve a classifier. Say Naive Bayes? If its possible, then I would like to integrate it into mahout. Given m classes and the associated documents, One can build m topic models right. (set of topics(words) under each label and the associated probability distribution of words). How can i use that info weight the most relevant topic of a class ? LDA has two meanings: linear discriminant analysis and latent dirichlet allocation. My code is the latter. The former is a kind of classification. You say linear discriminant analysis in the outline.
Re: 0.2
2009/9/28 Grant Ingersoll gsing...@apache.org: On Sep 28, 2009, at 2:16 PM, Ted Dunning wrote: Many of these are actually nearly (completely) done. Is there a goal for the 0.2 release other than fixing outstanding issues? I'd like to see some of the performance issues around SparseVector taken care of. I think we also said we wanted to get the Random Forest and Bayes stuff in that Robin and Deneche are working on. Beyond that, I plan on doing some profiling of the LDA stuff. I'd say we are pretty close. From my memory, using hprof, it looks like most of the time is spent doing math. (I haven't had a chance to try out YourKit, though.) -- David On Mon, Sep 28, 2009 at 10:53 AM, Grant Ingersoll gsing...@apache.orgwrote: Not too many open at this point: https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310751versionId=12313278showOpenIssuesOnly=true Some are relatively minor, others are ready, but just need a final review. Can we push towards mid-October for a release? Anyone volunteer to be the release mgr? -Grant -- Ted Dunning, CTO DeepDyve -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: 0.2 planning
Any reason we haven't closed LDA yet https://issues.apache.org/jira/browse/MAHOUT-123 ? -- David On Sat, Sep 12, 2009 at 11:45 AM, Grant Ingersoll gsing...@apache.org wrote: Here's the list of unresolved issues for 0.2: https://issues.apache.org/jira/secure/BrowseVersion.jspa?id=12310751versionId=12313278showOpenIssuesOnly=true Can we start to work towards whittling these down and getting 0.2 out? Maybe by mid-October? -Grant
[jira] Commented: (MAHOUT-172) When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs
[ https://issues.apache.org/jira/browse/MAHOUT-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754588#action_12754588 ] David Hall commented on MAHOUT-172: --- Sorry, just noticed this issue! Looks good to me. -- David When running on a Hadoop cluster LDA fails with Caused by: java.io.IOException: Cannot open filename /user/*/output/state-*/_logs - Key: MAHOUT-172 URL: https://issues.apache.org/jira/browse/MAHOUT-172 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.1 Reporter: Isabel Drost Fix For: 0.2 Attachments: lda.patch I tried running the reuters example of lda on a hadoop cluster today. Seems like the implementation tries to read all files in output/state-* which fails if in that directory _logs is found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-161) Add Vector.norm to compute k-norms of vectors
Add Vector.norm to compute k-norms of vectors - Key: MAHOUT-161 URL: https://issues.apache.org/jira/browse/MAHOUT-161 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: David Hall Fix For: 0.2 Attachments: MAHOUT-161 This patch adds Vector.norm(double power) to Vector (and an implementation to AbstractVector). AbstractVector.normalize now calls norm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-161) Add Vector.norm to compute k-norms of vectors
[ https://issues.apache.org/jira/browse/MAHOUT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-161: -- Attachment: MAHOUT-161 Add Vector.norm to compute k-norms of vectors - Key: MAHOUT-161 URL: https://issues.apache.org/jira/browse/MAHOUT-161 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: David Hall Fix For: 0.2 Attachments: MAHOUT-161 This patch adds Vector.norm(double power) to Vector (and an implementation to AbstractVector). AbstractVector.normalize now calls norm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-161) Add Vector.norm to compute k-norms of vectors
[ https://issues.apache.org/jira/browse/MAHOUT-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740821#action_12740821 ] David Hall commented on MAHOUT-161: --- normalize(power) returns this.divide(norm(power)). norm(p) == ||v||_p, normalize(p) == v / ||v||_p norm(p) is useful to check if a vector is sufficiently close to 0, e.g. if the difference between two phases of an optimization is close enough to convergence. Add Vector.norm to compute k-norms of vectors - Key: MAHOUT-161 URL: https://issues.apache.org/jira/browse/MAHOUT-161 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: David Hall Fix For: 0.2 Attachments: MAHOUT-161 This patch adds Vector.norm(double power) to Vector (and an implementation to AbstractVector). AbstractVector.normalize now calls norm. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: MAHOUT-123.patch The problem was that the edited patch wrote topics to .../examples/ and not ../examples , which took a frustratingly long time to figure out. I also played with the parameters a little longer. The topics aren't as great as I'd like, but it's because I haven't figured out the right setting for getting rid of stop words. could and said are still in there. That said, they're mostly coherent topics, if kind of boring. -- David Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: MAHOUT-123.patch Patch fixed for Yanen's problem. Apparently I messed up the dependencies somehow so that they'd work on my machine, but not anywhere else. Now I think it's ok. (I nuked my Maven repo.) -- David Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: MAHOUT-123.patch Ok, core/bin/build-reuters will: download reuters to work/reuters(something or another), untar it, build an index with it using lucene, convert said index into vectors, run lda for 40 iterations (which is close enough to convergence) to work/lda, and then dump the top 100 words for each topic in into work/topics/topic-K, where K is the topic of interest. Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738165#action_12738165 ] David Hall commented on MAHOUT-123: --- I unfortunately haven't run it on a Hadoop cluster yet. It should just work if you run it with the right Hadoop configuration. Shouldn't running it through the hadoop shell script add the configuration? I'll get it running on a hadoop cluster soon. The code actually requires Hadoop 0.20, because Mahout has decided to move in that direction. -- David Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736238#action_12736238 ] David Hall commented on MAHOUT-123: --- So it looks like the way Lucene does it is w/ an ant task. I can't figure out the maven way to do this, without my building some kind of jar from it. I'm happy to do it, but I'm not sure what the proper way to do this is. Thoughts? -- David Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735705#action_12735705 ] David Hall commented on MAHOUT-123: --- Ok, I'll add more comments. I had been using Reuters 21578, but I'm not convinced that it's ok to include it, and I was looking around for something better. I'll get the download automated for wikipedia chunks. Is a shell script ok to do most of it? -- David Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Hadoop 0.20 and DummyOutputCollector
The Hadoop gods have seen fit to deprecate OutputCollector and replace it with a non-static inner class of Mapper called Context. This complicates several tests, namely: ./src/test/java/org/apache/mahout/classifier/bayes/BayesFeatureMapperTest.java ./src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java ./src/test/java/org/apache/mahout/clustering/dirichlet/TestMapReduce.java ./src/test/java/org/apache/mahout/clustering/fuzzykmeans/TestFuzzyKmeansClustering.java ./src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java ./src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java ./src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java ./src/test/java/org/apache/mahout/ga/watchmaker/EvalMapperTest.java and my new LDA test. These won't work in the new api. As far as I can tell, the only ways to fix this are to either: 1) factor out a method that is testable, but this makes the code less idiomatic 2) abuse reflection to create a Mapper.Context (not sure about this, but I imagine it's doable), and supply a utility method for this 3) individually override each Mapper class in each test and include similar logic to get it done. Thoughts? -- David
Re: Hadoop 0.20 and DummyOutputCollector
Sorry, I don't think that's gonna work for tests, unless I misunderstood you. Here's the old interface: public void map(K key, V val, OutputCollectorK, V output, Reporter reporter) throws IOException And the new: public void map(K key, V value, Context context) throws IOException The tests historically have called map(whatever,whatever2, dummyCollector, null); There is no obvious analog to a dummyCollector, precisely because Context is a *non-static* class. -- David On Wed, Jul 22, 2009 at 1:49 PM, Sean Owensro...@gmail.com wrote: Copy-n-paste the old or new code and continue to use it? On Wed, Jul 22, 2009 at 9:26 PM, David Halld...@cs.stanford.edu wrote: The Hadoop gods have seen fit to deprecate OutputCollector and replace it with a non-static inner class of Mapper called Context. This complicates several tests, namely:
Re: Hadoop 0.20 and DummyOutputCollector
It's a non-static inner class, you can't construct those outside of the enclosing class can you? Or is my java that rusty? -- David On Wed, Jul 22, 2009 at 2:14 PM, Ted Dunningted.dunn...@gmail.com wrote: I don't quite understand how this complicates testing but leaves this code running. Why can these classes not construct a mock Context? On Wed, Jul 22, 2009 at 1:26 PM, David Hall d...@cs.stanford.edu wrote: These won't work in the new api. As far as I can tell, the only ways to fix this are to either: 1) factor out a method that is testable, but this makes the code less idiomatic 2) abuse reflection to create a Mapper.Context (not sure about this, but I imagine it's doable), and supply a utility method for this 3) individually override each Mapper class in each test and include similar logic to get it done. -- Ted Dunning, CTO DeepDyve
Re: Hadoop 0.20 and DummyOutputCollector
Mapper.Context extends MapContext. The interfaces takes Mapper.Context, and IIRC method arguments aren't contravariant in Java. -- David On Wed, Jul 22, 2009 at 2:20 PM, Ted Dunningted.dunn...@gmail.com wrote: Got it. Sorry to be dense. Is it really necessary to abuse reflection? What about this: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/MapContext.html ? On Wed, Jul 22, 2009 at 2:15 PM, David Hall d...@cs.stanford.edu wrote: There is no obvious analog to a dummyCollector, precisely because Context is a *non-static* class. -- Ted Dunning, CTO DeepDyve
Re: Hadoop 0.20 and DummyOutputCollector
Oops, they are contravariant. Ok, so my proposed best practice is to make Mapper's take a MapContext and not a Mapper.Context, and I'll cook up a DummyContext real fast. -- David On Wed, Jul 22, 2009 at 2:22 PM, David Halld...@cs.stanford.edu wrote: Mapper.Context extends MapContext. The interfaces takes Mapper.Context, and IIRC method arguments aren't contravariant in Java. -- David On Wed, Jul 22, 2009 at 2:20 PM, Ted Dunningted.dunn...@gmail.com wrote: Got it. Sorry to be dense. Is it really necessary to abuse reflection? What about this: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/MapContext.html ? On Wed, Jul 22, 2009 at 2:15 PM, David Hall d...@cs.stanford.edu wrote: There is no obvious analog to a dummyCollector, precisely because Context is a *non-static* class. -- Ted Dunning, CTO DeepDyve
Re: Hadoop 0.20 and DummyOutputCollector
Nope, I was write the first time; they're invariant. I forgot the @Override annotation. Also, you can apparently create an inner class by saying MyObj.InnerClass ic = myObj.new InnerClass(args); This doesn't really get us very far though, without reflection. -- David On Wed, Jul 22, 2009 at 2:25 PM, David Halld...@cs.stanford.edu wrote: Oops, they are contravariant. Ok, so my proposed best practice is to make Mapper's take a MapContext and not a Mapper.Context, and I'll cook up a DummyContext real fast. -- David On Wed, Jul 22, 2009 at 2:22 PM, David Halld...@cs.stanford.edu wrote: Mapper.Context extends MapContext. The interfaces takes Mapper.Context, and IIRC method arguments aren't contravariant in Java. -- David On Wed, Jul 22, 2009 at 2:20 PM, Ted Dunningted.dunn...@gmail.com wrote: Got it. Sorry to be dense. Is it really necessary to abuse reflection? What about this: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/MapContext.html ? On Wed, Jul 22, 2009 at 2:15 PM, David Hall d...@cs.stanford.edu wrote: There is no obvious analog to a dummyCollector, precisely because Context is a *non-static* class. -- Ted Dunning, CTO DeepDyve
Re: Hadoop 0.20 and DummyOutputCollector
as in JMock? I have 0 experience with JMock, but I'll look into it. -- David On Wed, Jul 22, 2009 at 2:42 PM, Ted Dunningted.dunn...@gmail.com wrote: Can you mock the object? (that counts as using reflection and more, but is approved) On Wed, Jul 22, 2009 at 2:33 PM, David Hall d...@cs.stanford.edu wrote: Also, you can apparently create an inner class by saying MyObj.InnerClass ic = myObj.new InnerClass(args); This doesn't really get us very far though, without reflection.
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: MAHOUT-123.patch Everything fixed except adding an example. What's the best way to include data with Mahout? I've never had luck autogenerating data for LDA. Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Hadoop 0.20 and DummyOutputCollector
EasyMock for the win. Thanks for the suggestion! No response on hadoop list, but the mock seems like a fine solution to me. In the simplest case you can just make sure that the call to write is called K times, for whatever K, and in the advanced case you can actually capture the outputs. -- David On Wed, Jul 22, 2009 at 4:05 PM, Ted Dunningted.dunn...@gmail.com wrote: Or EasyMock. These are amazing libraries that actually twiddle byte code in some cases to emulate classes that would otherwise not be constructable. On Wed, Jul 22, 2009 at 2:46 PM, David Hall d...@cs.stanford.edu wrote: as in JMock? I have 0 experience with JMock, but I'll look into it. -- David On Wed, Jul 22, 2009 at 2:42 PM, Ted Dunningted.dunn...@gmail.com wrote: Can you mock the object? (that counts as using reflection and more, but is approved) On Wed, Jul 22, 2009 at 2:33 PM, David Hall d...@cs.stanford.edu wrote: Also, you can apparently create an inner class by saying MyObj.InnerClass ic = myObj.new InnerClass(args); This doesn't really get us very far though, without reflection. -- Ted Dunning, CTO DeepDyve
Re: Hadoop 0.20 and DummyOutputCollector
Oh, they did propose a soln, slated for inclusion in 0.21.0 http://issues.apache.org/jira/browse/hadoop-5518 http://www.cloudera.com/hadoop-mrunit However, these use the deprecated APIs... I think EasyMock might be better here. -- David On Wed, Jul 22, 2009 at 5:24 PM, David Halld...@cs.stanford.edu wrote: EasyMock for the win. Thanks for the suggestion! No response on hadoop list, but the mock seems like a fine solution to me. In the simplest case you can just make sure that the call to write is called K times, for whatever K, and in the advanced case you can actually capture the outputs. -- David On Wed, Jul 22, 2009 at 4:05 PM, Ted Dunningted.dunn...@gmail.com wrote: Or EasyMock. These are amazing libraries that actually twiddle byte code in some cases to emulate classes that would otherwise not be constructable. On Wed, Jul 22, 2009 at 2:46 PM, David Hall d...@cs.stanford.edu wrote: as in JMock? I have 0 experience with JMock, but I'll look into it. -- David On Wed, Jul 22, 2009 at 2:42 PM, Ted Dunningted.dunn...@gmail.com wrote: Can you mock the object? (that counts as using reflection and more, but is approved) On Wed, Jul 22, 2009 at 2:33 PM, David Hall d...@cs.stanford.edu wrote: Also, you can apparently create an inner class by saying MyObj.InnerClass ic = myObj.new InnerClass(args); This doesn't really get us very far though, without reflection. -- Ted Dunning, CTO DeepDyve
Re: [jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
It's more as an example than a test. I autogenerate data for the tests, which are there for sanity. -- David On Wed, Jul 22, 2009 at 6:26 PM, Ted Dunningted.dunn...@gmail.com wrote: IN a maven standard build, test data is often included under src/test/resources On Wed, Jul 22, 2009 at 5:23 PM, David Hall (JIRA) j...@apache.org wrote: What's the best way to include data with Mahout? I've never had luck autogenerating data for LDA. -- Ted Dunning, CTO DeepDyve
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: MAHOUT-123.patch Ok, here's the updates for the vectors. I'll add a page to the wiki shortly. As for testing, this is actually something I'd like some direction on. It's never been clear to me how to test the actual implementation of clustering algorithms in any meaningful way. Looking at the Dirichlet clusterer, all that it tests are that serialization works, that things aren't null, and that it outputs the right number of things. Serialization in this case doesn't seem terribly necessary since my model are just serialized writables. So... I should just add some basic sanity checks? -- David Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-126: -- Attachment: MAHOUT-123.patch Ok, I'm going to call this a mostly functional patch. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-126: -- Attachment: (was: MAHOUT-123.patch) Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (MAHOUT-126) Prepare document vectors from the text
Ignore this. Wrong issue. On Fri, Jun 19, 2009 at 12:59 AM, David Hall (JIRA)j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-126: -- Attachment: MAHOUT-123.patch Ok, I'm going to call this a mostly functional patch. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721346#action_12721346 ] David Hall commented on MAHOUT-126: --- That's not the only time. This constructor clearly lets certain things slip through. {code} public CachedTermInfo(IndexReader reader, String field, int minDf, int maxDfPercent) throws IOException { this.field = field; TermEnum te = reader.terms(new Term(field, )); int count = 0; int numDocs = reader.numDocs(); double percent = numDocs * maxDfPercent / 100.0; //Should we use a linked hash map so that we no terms are in order? termEntries = new LinkedHashMapString, TermEntry(); do { Term term = te.term(); if (term == null || term.field().equals(field) == false){ break; } int df = te.docFreq(); if (df minDf || df percent){ continue; } TermEntry entry = new TermEntry(term.text(), count++, df); termEntries.put(entry.term, entry); } while (te.next()); te.close(); {code} My code is essentially Lucene's demo indexing code (IndexFiles.java and FileDocument.java: http://google.com/codesearch/p?hl=ensa=Ncd=1ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.javaq=org.apache.lucene.demo.IndexFiles } except that I replaced {code}doc.add(new Field(contents, new FileReader(f)));{code} with {code} doc.add(new Field(contents, new FileReader(f),Field.TermVector.YES));{code} I then ran {code} java -cp classpath org.apache.lucene.demo.IndexFiles /Users/dlwh/txt-reuters/ {code} and then {code} java -cp classpath org.apache.mahout.utils.vectors.Driver --dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t /Users/dlwh/dict --weight TF {code} For what's it worth, it gives a null on reuters, which is not usually a stop word, except that every single document ends with it, and so the IDF filtering above is catching it. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: MAHOUT-65
oh, wow, nevermind. Vector implements writable. Sorry everyone. -- David On Thu, Jun 18, 2009 at 12:19 PM, David Halld...@cs.stanford.edu wrote: actually, it looks like someone went to all the trouble to make both SparseVector and DenseVector have all the methods required by Writable, but they don't implement Writable. Could I just make Vector extend Writable? -- David On Thu, Jun 18, 2009 at 12:01 PM, David Halld...@cs.stanford.edu wrote: following up on my earlier email. Would anyone be interested in a compressed serialization for DenseVector/SparseVector that follows in the vein of hadoop.io.Writable? The space overhead for gson (parsing issues not-withstanding) is pretty high, and it wouldn't be terribly hard to implement a high-performance thing for vectors. -- David On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1, you added name constructors that I didn't have and the equals/equivalent stuff. Ya, Gson makes it all pretty trivial once you grok it. Grant Ingersoll wrote: Shall I take that as approval of the approach? BTW, the Gson stuff seems like a winner for serialization. On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote: You gonna commit your patch? I agree with shortening the class name in the JsonVectorAdapter and will do it once you commit ur stuff. Jeff
Re: MAHOUT-65
How often does Mahout need the Comparable part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it.
Re: MAHOUT-65
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastmanj...@windwardsolutions.com wrote: Shall I change the method to asWritable()? I'd just be for getting rid of it. Vector implements Writable, so asWritable() could just be return this;, which seems gratuitous As for actual efficiency: lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java is currently dumping output values as the text strings. If there's a standard dataset, that would be an easy place to do the test. - David I don't know of any situations where Vectors are used as keys. It hardly makes sense to use them as they are so unwieldy. Suggest we could change to just Writable and be ahead. In terms of the potential density improvement, it will be interesting to see what can typically be achieved. r786323 just removed all calls to asWritableComparable, replacing them with asFormatString which was correct anyway. Jeff David Hall wrote: How often does Mahout need the Comparable part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it.
Who runs the github clone of Mahout?
I'd like to request it be updated to svn's head... Thanks, David
Re: Who runs the github clone of Mahout?
http://github.com/apache/mahout/ Researching it a little, it seems to be run some kind of auto-mirroring. A lot of (all? most?) apache projects are there, but they all haven't been updated since june 10. -- David On Wed, Jun 17, 2009 at 10:53 AM, Grant Ingersollgsing...@apache.org wrote: There's a github clone of Mahout? On Jun 17, 2009, at 1:20 PM, David Hall wrote: I'd like to request it be updated to svn's head... Thanks, David
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720816#action_12720816 ] David Hall commented on MAHOUT-126: --- LuceneIteratable (is that an intentional pun?) has behavior that isn't documented well. Namely, if the normless constructor is called, the norm defaults to 2. This has the consequence that not passing in a norm to Driver L2 normalizes the vectors. You have to specify a negative double != -1.0 to get unnormalized counts. Relatedly, -1 maps to the L2 norm. This is odd behavior to me, or it should at least be documented. (The wiki article implies there's a difference between using --norm 2 and using no norm at all.) Also, I'd like an option to tell Driver what weight object to use. I can do the patch for this. Thanks! Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721068#action_12721068 ] David Hall commented on MAHOUT-126: --- Ok, I'm probably misunderstanding something, or there could be a bug. I modified Lucene's demo indexer to store a term vector. It's still crashing. I added a series of printlns before TermVector.java:65 and CachedTermInfo:71, and I end up with the assertion here failing: {{ @Override public TermEntry getTermEntry(String field, String term) { if (this.field.equals(field) == false){ return null;} TermEntry ret = termEntries.get(term); assert(ret != null); // This assertion is firing. return ret; } }} In my dataset, this happens after several hundred iterations. The term is a stop-word for the corpus in question, and it looks like there's an attempt at stopwording earlier in the file. Maybe these are not interacting well? -- David Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-126: -- Attachment: MAHOUT-126-null-entry.patch I'm going to assume that's the problem. The attached patch just skips over any null term vectors. It seems like reasonable behavior here, given the filtering. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: lda.patch Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Fix Version/s: 0.2 Affects Version/s: 0.2 Status: Patch Available (was: Open) This is a roughcut implementation. Not ready to go yet. I've been waiting on MAHOUT-126 because it seems like the way to create the Vectors I need. Or perhaps there's a better way. Basic approach follows the Dirichlet implementation. There is a driver class (LDA Driver) which runs K mapreduces, and a Mapper and a Reducer. We also have an Inferencer, which is what the Mapper uses to compute expected sufficient statistics. A document is just a V-dimensional sparse vector of word counts. Map: Perform Inference on each document (~ E-step) and output log probabilities of p(word|topic) Reduce: logSum the input log probabilities (~ M-Step), and output the result. Loop: use the results of the reduce as the log probabilities for the map. Remaining: 1) Actually run the thing 2) Number-of-non-zero elements in a sparse vector. Is that staying size? 3) Allow for computing of likelihood to determine when we're done. 4) What's the status of serializing as sparse vector and reading as a dense vector? Is that going to happen? 5) Find a fun data set to bundle... 6) Convenience method for running just inference on a set of documents and outputting MAP estimates of word probabilities. Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Reporter: David Hall Assignee: Grant Ingersoll Fix For: 0.2 Attachments: lda.patch Original Estimate: 504h Remaining Estimate: 504h (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714362#action_12714362 ] David Hall commented on MAHOUT-126: --- Sure, I just want to be able to have: double weight = similarity.tf(termFreq) * similarity.idf(docFreq, numDocs); be this instead: double weight = termFreq based on some configuration or another. (Maybe if I can just pass in a custom Similarity object? Or there could be a protected method createSimilarity that I could override?) Basically, LDA wants raw counts (or at least, some kind of integers). Thanks! Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Attachments: MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: digamma function now in commons math
Sounds good. Thanks for taking the time to do that! Should I add a dependency on math 2.0-SNAPSHOT then? It seems unlikely to cause any problems, except in my own code. On Mon, May 25, 2009 at 11:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: David, The commons math team accepted my patch and have committed it to trunk. The version that they have includes better test cases and a trigramma function. This patch will be part of the 2.0 release which is still some time away. -- Ted Dunning, CTO DeepDyve
[jira] Created: (MAHOUT-123) Implement Latent Dirichlet Allocation
Implement Latent Dirichlet Allocation - Key: MAHOUT-123 URL: https://issues.apache.org/jira/browse/MAHOUT-123 Project: Mahout Issue Type: New Feature Components: Clustering Reporter: David Hall (For GSoC) Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. References: David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003 T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. Ramesh Nallapati, William Cohen, John Lafferty, Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability, ICDM workshop on high performance data mining, 2007. Newman, D., Asuncion, A., Smyth, P., Welling, M. Distributed Inference for Latent Dirichlet Allocation. NIPS, 2007. Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov continuous-time model of topical trends. KDD, 2006 Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Special functions
Hi, For my project, I need to have an impl of the digamma function: http://en.wikipedia.org/wiki/Digamma_function Apache commons math doesn't have it (oddly), so I need to acquire it from somewhere else. I trust Radford Neal, who wrote the implementation here: http://google.com/codesearch/p?hl=en#EbB356_xxkI/fbm.2003-06-29/util/digamma.c The license seems more than permissive enough... Alternatively, I can try to track down a book (Numerical Recipes?) with pseudocode. -- David
Re: Special functions
Share is too strong. He released a number of functions in the library I linked to, and the only requirement of the license seems to be we maintain the copyright notice and say what we changed: /* Copyright (c) 1995-2003 by Radford M. Neal * * Permission is granted for anyone to copy, use, modify, or distribute this * program and accompanying programs and documents for any purpose, provided * this copyright notice is retained and prominently displayed, along with * a note saying that the original programs are available from Radford Neal's * web page, and note is made of any changes made to the programs. The * programs and documents are distributed without any warranty, express or * implied. As the programs were written for research purposes only, they have * not been tested to the degree that would be advisable in any important * application. All use of these programs is entirely at the user's own risk. */ I can also just email him directly. -- David On Sat, May 23, 2009 at 2:22 PM, Ted Dunning ted.dunn...@gmail.com wrote: Avoid Numerical Recipes if you want to avoid license issues. Their publisher has a strong history of being very strict about their interpretation of what they think they own. If Radford Neal has an implementation that he would share, I would count that as a great contribution. On Sat, May 23, 2009 at 2:09 PM, David Hall d...@cs.stanford.edu wrote: Alternatively, I can try to track down a book (Numerical Recipes?) with pseudocode. -- Ted Dunning, CTO DeepDyve
Re: Special functions
Relatedly, I need an implementation of logGamma, which is available in apache commons math. Can I add a dependency? -- David On Sat, May 23, 2009 at 2:26 PM, David Hall d...@cs.stanford.edu wrote: Share is too strong. He released a number of functions in the library I linked to, and the only requirement of the license seems to be we maintain the copyright notice and say what we changed: /* Copyright (c) 1995-2003 by Radford M. Neal * * Permission is granted for anyone to copy, use, modify, or distribute this * program and accompanying programs and documents for any purpose, provided * this copyright notice is retained and prominently displayed, along with * a note saying that the original programs are available from Radford Neal's * web page, and note is made of any changes made to the programs. The * programs and documents are distributed without any warranty, express or * implied. As the programs were written for research purposes only, they have * not been tested to the degree that would be advisable in any important * application. All use of these programs is entirely at the user's own risk. */ I can also just email him directly. -- David On Sat, May 23, 2009 at 2:22 PM, Ted Dunning ted.dunn...@gmail.com wrote: Avoid Numerical Recipes if you want to avoid license issues. Their publisher has a strong history of being very strict about their interpretation of what they think they own. If Radford Neal has an implementation that he would share, I would count that as a great contribution. On Sat, May 23, 2009 at 2:09 PM, David Hall d...@cs.stanford.edu wrote: Alternatively, I can try to track down a book (Numerical Recipes?) with pseudocode. -- Ted Dunning, CTO DeepDyve
Re: Special functions
Thanks! -- David On Sat, May 23, 2009 at 4:44 PM, Ted Dunning ted.dunn...@gmail.com wrote: David, Actually, I just looked around and didn't see much interesting and cleanly available along this line so I just wrote a digamma function. See https://issues.apache.org/jira/browse/MATH-267 for a tar file containing an implementation with test cases. I went ahead and copyrighted this for apache use. It contains source, comments and test values derived from mathematica. In the process, I discovered that the R implementation of digamma is really crappy for medium small positive values of x. On Sat, May 23, 2009 at 2:26 PM, David Hall d...@cs.stanford.edu wrote: Share is too strong. He released a number of functions in the library I linked to, and the only requirement of the license seems to be we maintain the copyright notice and say what we changed: /* Copyright (c) 1995-2003 by Radford M. Neal * * Permission is granted for anyone to copy, use, modify, or distribute this * program and accompanying programs and documents for any purpose, provided * this copyright notice is retained and prominently displayed, along with * a note saying that the original programs are available from Radford Neal's * web page, and note is made of any changes made to the programs. The * programs and documents are distributed without any warranty, express or * implied. As the programs were written for research purposes only, they have * not been tested to the degree that would be advisable in any important * application. All use of these programs is entirely at the user's own risk. */ I can also just email him directly. -- David On Sat, May 23, 2009 at 2:22 PM, Ted Dunning ted.dunn...@gmail.com wrote: Avoid Numerical Recipes if you want to avoid license issues. Their publisher has a strong history of being very strict about their interpretation of what they think they own. If Radford Neal has an implementation that he would share, I would count that as a great contribution. On Sat, May 23, 2009 at 2:09 PM, David Hall d...@cs.stanford.edu wrote: Alternatively, I can try to track down a book (Numerical Recipes?) with pseudocode. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
Re: neural network
Map-Reduce for Machine Learning on Multicore by: Cheng T Chu, Sang K Kim, Yi A Lin, Yuanyuan Yu, Gary R Bradski, Andrew Y Ng, Kunle Olukotun edited by: Bernhard Schölkopf, John C Platt, Thomas Hoffman http://www.citeulike.org/user/zzztimbo/article/2308503 On Thu, May 7, 2009 at 1:42 PM, Danny-Michael Busch da...@kurbel.net wrote: Ted Dunning schrieb: I don't think that anybody has done any serious work on this yet. I was starting here: http://cwiki.apache.org/MAHOUT/neural-network.html However, I could not found the mentioned paper on the nips.cc website - could anyone give me a hint where this paper could be found? Thanks, Danny -- KURBEL Softwareentwicklung IT - Beratung Danny-Michael Busch Wilhelmstraße 2 D-35392 Gießen Tel.: (01520) 849 8469 http://www.kurbel.net
Re: [GSOC] Accepted Students
Thanks everyone! -- David On Thu, Apr 23, 2009 at 12:53 PM, Grant Ingersoll gsing...@apache.org wrote: It's also helpful to get yourself a Wiki account and a JIRA account if you don't already have them. Small patches to the existing docs/code can also help you figure out the process On Apr 21, 2009, at 1:19 PM, Isabel Drost wrote: On Tuesday 21 April 2009 08:30:34 David Hall wrote: As for questions, what am I supposed to be reading during this community building period? I see: * http://cwiki.apache.org/MAHOUT/howtocontribute.html * http://www.apache.org/foundation/how-it-works.html plus skimming javadocs. These are certainly of interest. In addition you can checkout and have a look at the code. Try to get a rough idea of where your contribution would fit best. Please share your ideas with the community to get feedback early on.
Re: Introduction for student interested in GSoC
Here's a followup proposal (submitted to GSOC's site. I will add it to the wiki, but I'm having trouble accessing it right now) Thanks! -- David Title/Summary: Distributed Latent Dirichlet Allocation Student: David Hall Student e-mail: d...@cs.stanford.edu Student Major: Symbolic Systems/ Computer Science Student Degree: MS/PhD Student Graduation: Stanford '09 / Berkeley '14 Organization: Hadoop Assigned Mentor: Grant Ingersoll Abstract: Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into topics and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008). In this project, I propose to implement a distributed variant of Latent Dirichlet Allocation using MapReduce, and, time permitting, to investigate extensions of LDA and possibly more efficient algorithms for distributed inference. Detailed Description: A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about sports, such as baseball, home run, player, and a document about steroid use in baseball might include sports, drugs, and politics. Note that the labels sports, drugs, and politics, are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions. One of the promises of unsupervised learning algorithms like Latent Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a massive collections of documents and condense them down into a collection of easily understandable topics. However, all available open source implementations of LDA and related topics models are not distributed, which hampers their utility. This project seeks to correct this shortcoming. In the literature, there have been several proposals for paralellzing LDA. Newman, et al (2007) proposed to create an approximate LDA in which each processors gets its own subset of the documents to run Gibbs sampling over. However, Gibbs sampling is slow and stochastic by its very nature, which is not advantageous for repeated runs. Instead, I propose to follow Nallapati, et al. (2007) and use a variational approximation that is fast and non-random. From there, I would like extend LDA to either Supervised Topic Models (Blei McAulliffe, 2007) or Topics over Time (Wang McCallum, 2006). The former would enable LDA to be used for basic classification tasks, and the latter would model topic dynamics explicitly. For instance, a politics topic is not static: certain words are more important over time and the prominence of sports itself rises and falls around playoff schedules and the like. A basic implementation of either of these would be reasonably straightforward extension to LDA, and they would also prove the flexibility of my implementation. Finally, time permitting, I would like to examine the more efficient algorithms using junction trees proposed by Wolfe, et al. (2008). They demonstrate substantial speed up over the naive implementation proposed earlier, but the framework does not fit as easily into standard map-reduce architecture as implemented by Hadoop. I anticipate that this work will be more exploratory in nature, with a focus on laying the ground work for improvements more than a polished implementation. Biography: I am a graduating masters student in Symbolic Systems from Stanford University, and I will begin work on the PhD at UC Berkeley next autumn. I am currently a member of Stanford's Natural Language Processing group under Dan Jurafsky and Chris Manning. (I will be working Dan Klein the fall.) My research has involved the application of topic models to modeling and discovering the origins of scientific paradigms, breakthroughs that change a scientific field or even create a new field. More recently, I've worked on minimally unsupervised part of speech tagging. In terms of my experience with Hadoop and Map Reduce, I have interned at Google for two summers, working with MapReduce for the entirety of both internships. My second summer I worked on Google's Machine Translation team, and so I am familiar with using Map Reduce to implement large-scale NLP and Machine Learning algorithms. More recently, I've been in charge of setting up and maintaining a small Hadoop cluster in my research group, and I've written a wrapper library for Hadoop in the Scala Programming Language. The code isn't quite release quality yet, but you can see its work-in-progress state at http://bugs.scalanlp.org/repositories/show/smr , and I've written a short blog post about it at http://scala-blogs.org
Re: Introduction for student interested in GSoC
On Tue, Mar 24, 2009 at 4:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: This sounds fantastic. I think that your scala code is interesting, but your thoughts on LDA are much more so. I tried doing a similar simplification of map-reduce program writing using groovy and found that in spite of even smaller programs than you quote for word-count, that the benefits in practice were relatively small. Using Pig was much more productive, even with the lack of any real programming language. Thanks! I agree that SMR isn't there yet, and it really isn't a Mahout thing. I could get closer to the Groovy line count, but my main goal was to remove all the boiler plate associated with Hadoop (Text,IntWritable,Mapper/Reducer) and to get closer to the real program logic. You are right that Pig is usually more useful for many tasks, and one of my plans is to duplicate some of its functionality, though I actually think I prefer Dryad/LINQ's kind of syntax. It would also be interesting to see how you might attack semi-supervised multi-task learning using a well-founded Bayesian approach. For a non-Bayesian example with impressive results, see Ronan Collobert's paper: http://ronan.collobert.com/pub/2008_nlp_icml.html Interesting. I'll take a closer look at this this evening. -- David On Tue, Mar 24, 2009 at 12:26 AM, David Hall d...@cs.stanford.edu wrote: This summer, I'd like to help contribute to the Mahout project. I read Tijs Zwinkels' proposal, and I think that what I would like to work on is sufficiently different from what he would like to do. First, I would like to implement Latent Dirichilet Allocation, a popular topic mixture model that learns both document clusters and word clusters. I would then like to extend it to implement a number of general purpose topic models, including Topics over Time, Pachinko Allocation, and possibly Supervised Topic Models. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: Introduction for student interested in GSoC
On Tue, Mar 24, 2009 at 4:34 PM, David Hall d...@cs.stanford.edu wrote: On Tue, Mar 24, 2009 at 4:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: It would also be interesting to see how you might attack semi-supervised multi-task learning using a well-founded Bayesian approach. For a non-Bayesian example with impressive results, see Ronan Collobert's paper: http://ronan.collobert.com/pub/2008_nlp_icml.html Interesting. I'll take a closer look at this this evening. Actually, my officemate's dissertation project is very closely related to this, except using parsing as a base. That is to say, I probably shouldn't work on it, because I'd be stepping on her toes... -- David