Re: [jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items
Transposing is actually a common need as you abstract away from users and ratings. On Thu, Jun 18, 2009 at 10:19 PM, Sean Owen (JIRA) wrote: > Looks OK to me -- I applied the patch locally and tweaked a few things. > Seems like a rare use case but simple to implement anyway. Mind if I submit > over here? > > > Allow FileDataModel to transpose users and items > >
[jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items
[ https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721653#action_12721653 ] Sean Owen commented on MAHOUT-135: -- Looks OK to me -- I applied the patch locally and tweaked a few things. Seems like a rare use case but simple to implement anyway. Mind if I submit over here? > Allow FileDataModel to transpose users and items > > > Key: MAHOUT-135 > URL: https://issues.apache.org/jira/browse/MAHOUT-135 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-135.patch > > > Sometimes it would be nice to flip around users and items in the > FileDataModel. This patch adds a transpose boolean that flips userId and > itemId in the processLine method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [GSOC] Thoughts about Random forests map-reduce implementation
Very similar, but I was talking about building trees on each split of the data (a la map reduce split). That would give many small splits and would thus give very different results from bagging because the splits would be small and contigous rather than large and random. On Thu, Jun 18, 2009 at 1:37 AM, deneche abdelhakim wrote: > "build multiple trees for different portions of the data" > > What's the difference with the basic bagging algorithm, which builds 'each > tree' using a different portion (about 2/3) of the data ?
[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721646#action_12721646 ] Sean Owen commented on MAHOUT-121: -- Since I am not hearing objections, and cognizant that people are waiting on this, going to commit. If there are issues we can roll back or tweak from there. > Speed up distance calculations for sparse vectors > - > > Key: MAHOUT-121 > URL: https://issues.apache.org/jira/browse/MAHOUT-121 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Reporter: Shashikant Kore > Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, > MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, > Mahout1211.patch > > > From my mail to the Mahout mailing list. > I am working on clustering a dataset which has thousands of sparse vectors. > The complete dataset has few tens of thousands of feature items but each > vector has only couple of hundred feature items. For this, there is an > optimization in distance calculation, a link to which I found the archives of > Mahout mailing list. > http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ > I tried out this optimization. The test setup had 2000 document vectors > with few hundred items. I ran canopy generation with Euclidean distance and > t1, t2 values as 250 and 200. > > Current Canopy Generation: 28 min 15 sec. > Canopy Generation with distance optimization: 1 min 38 sec. > I know by experience that using Integer, Double objects instead of primitives > is computationally expensive. I changed the sparse vector implementation to > used primitive collections by Trove [ > http://trove4j.sourceforge.net/ ]. > Distance optimization with Trove: 59 sec > Current canopy generation with Trove: 21 min 55 sec > To sum, these two optimizations reduced cluster generation time by a 97%. > Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. > > Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-135) Allow FileDataModel to transpose users and items
[ https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-135: --- Attachment: MAHOUT-135.patch Patch that adds transpose and tests > Allow FileDataModel to transpose users and items > > > Key: MAHOUT-135 > URL: https://issues.apache.org/jira/browse/MAHOUT-135 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-135.patch > > > Sometimes it would be nice to flip around users and items in the > FileDataModel. This patch adds a transpose boolean that flips userId and > itemId in the processLine method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-135) Allow FileDataModel to transpose users and items
Allow FileDataModel to transpose users and items Key: MAHOUT-135 URL: https://issues.apache.org/jira/browse/MAHOUT-135 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Sometimes it would be nice to flip around users and items in the FileDataModel. This patch adds a transpose boolean that flips userId and itemId in the processLine method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: MAHOUT-65
Er, um, I see what you mean. How about just deleting the method? What really needs doing then is for all of the various clusters to themselves implement Writable so that they don't need to call asFormatString but can just emit themselves. Jeff Ted Dunning wrote: What does this method do? If the vector already implements Writable, what is the purpose of a conversion? On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman wrote: Shall I change the method to asWritable()? PGP.sig Description: PGP signature
Re: MAHOUT-65
What does this method do? If the vector already implements Writable, what is the purpose of a conversion? On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman wrote: > Shall I change the method to asWritable()? -- Ted Dunning, CTO DeepDyve
Re: MAHOUT-65
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman wrote: > Shall I change the method to asWritable()? I'd just be for getting rid of it. Vector implements Writable, so asWritable() could just be "return this;", which seems gratuitous As for actual efficiency: lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java is currently dumping output values as the text strings. If there's a standard dataset, that would be an easy place to do the test. - David > I don't know of any situations where Vectors are used as keys. It hardly > makes sense to use them as they are so unwieldy. Suggest we could change to > just Writable and be ahead. In terms of the potential density improvement, > it will be interesting to see what can typically be achieved. > > r786323 just removed all calls to asWritableComparable, replacing them with > asFormatString which was correct anyway. > > > Jeff > > David Hall wrote: >> >> How often does Mahout need the "Comparable" part for Vectors? Are >> vectors commonly used as map output keys? >> >> In terms of space efficiency, I'd bet it's probably a bit better than >> a factor of two in the average case, especially for densevectors. The >> gson format is storing both the int index and the double as raw >> strings, plus whatever boundary characters. The writable >> implementation stores just the bytes of the double, plus a length. >> >> -- David >> >> On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman >> wrote: >> >>> >>> +1 asWritableComparable is a simple implementation that uses >>> asFormatString. >>> It would be good to rewrite it for internal communication. A factor of >>> two >>> is still a factor of two. >>> >>> Jeff >>> >>> >>> Grant Ingersoll wrote: >>> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: > > Writable should be plenty! > > +1. Still nice to have JSON for user facing though. > > On Thu, Jun 18, 2009 at 1:15 PM, David Hall > wrote: > > >> >> See my followup on another thread (sorry for the schizophrenic >> posting); Vector already implements Writable, so that's all I really >> can ask of it. Is there something more you'd like? I'd be happy to do >> it. >> >> >> >>> >>> >> >> >> > >
Re: MAHOUT-65
I don't know of any situations where Vectors are used as keys. It hardly makes sense to use them as they are so unwieldy. Suggest we could change to just Writable and be ahead. In terms of the potential density improvement, it will be interesting to see what can typically be achieved. r786323 just removed all calls to asWritableComparable, replacing them with asFormatString which was correct anyway. Shall I change the method to asWritable()? Jeff David Hall wrote: How often does Mahout need the "Comparable" part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman wrote: +1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it. PGP.sig Description: PGP signature
Re: MAHOUT-65
How often does Mahout need the "Comparable" part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman wrote: > +1 asWritableComparable is a simple implementation that uses asFormatString. > It would be good to rewrite it for internal communication. A factor of two > is still a factor of two. > > Jeff > > > Grant Ingersoll wrote: >> >> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: >> >>> Writable should be plenty! >>> >> >> +1. Still nice to have JSON for user facing though. >> >>> On Thu, Jun 18, 2009 at 1:15 PM, David Hall wrote: >>> See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it. >> >> >> >> > >
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation
[ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-123: -- Attachment: MAHOUT-123.patch (Still in progress.) It seems to work, but it's much to slow because I underestimated the badness of using DenseVectors. Switching to an element wise system now. > Implement Latent Dirichlet Allocation > - > > Key: MAHOUT-123 > URL: https://issues.apache.org/jira/browse/MAHOUT-123 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 >Reporter: David Hall >Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: lda.patch, MAHOUT-123.patch > > Original Estimate: 504h > Remaining Estimate: 504h > > (For GSoC) > Abstract: > Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning > algorithm for automatically and jointly clustering words into "topics" > and documents into mixtures of topics, and it has been successfully > applied to model change in scientific fields over time (Griffiths and > Steyver, 2004; Hall, et al. 2008). In this project, I propose to > implement a distributed variant of Latent Dirichlet Allocation using > MapReduce, and, time permitting, to investigate extensions of LDA and > possibly more efficient algorithms for distributed inference. > Detailed Description: > A topic model is, roughly, a hierarchical Bayesian model that > associates with each document a probability distribution over > "topics", which are in turn distributions over words. For instance, a > topic in a collection of newswire might include words about "sports", > such as "baseball", "home run", "player", and a document about steroid > use in baseball might include "sports", "drugs", and "politics". Note > that the labels "sports", "drugs", and "politics", are post-hoc labels > assigned by a human, and that the algorithm itself only assigns > associate words with probabilities. The task of parameter estimation > in these models is to learn both what these topics are, and which > documents employ them in what proportions. > One of the promises of unsupervised learning algorithms like Latent > Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a > massive collections of documents and condense them down into a > collection of easily understandable topics. However, all available > open source implementations of LDA and related topics models are not > distributed, which hampers their utility. This project seeks to > correct this shortcoming. > In the literature, there have been several proposals for paralellzing > LDA. Newman, et al (2007) proposed to create an "approximate" LDA in > which each processors gets its own subset of the documents to run > Gibbs sampling over. However, Gibbs sampling is slow and stochastic by > its very nature, which is not advantageous for repeated runs. Instead, > I propose to follow Nallapati, et al. (2007) and use a variational > approximation that is fast and non-random. > References: > David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007. > David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet > allocation, The Journal of Machine Learning Research, 3, p.993-1022, > 3/1/2003 > T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl > Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004. > David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying > the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008. > Ramesh Nallapati, William Cohen, John Lafferty, Parallelized > variational EM for Latent Dirichlet Allocation: An experimental > evaluation of speed and scalability, ICDM workshop on high performance > data mining, 2007. > Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed > Inference for Latent Dirichlet Allocation. NIPS, 2007. > Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov > continuous-time model of topical trends. KDD, 2006 > Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very > large datasets. ICML, 2008. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: MAHOUT-65
+1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it. PGP.sig Description: PGP signature
Re: MAHOUT-65
On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it.
Re: MAHOUT-65
Writable should be plenty! On Thu, Jun 18, 2009 at 1:15 PM, David Hall wrote: > See my followup on another thread (sorry for the schizophrenic > posting); Vector already implements Writable, so that's all I really > can ask of it. Is there something more you'd like? I'd be happy to do > it. > >
Re: MAHOUT-65
See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it. -- David On Thu, Jun 18, 2009 at 1:11 PM, Ted Dunning wrote: > +10!!! > > How would you like to do it? Something like avro? Thrift? Homespun? > > On Thu, Jun 18, 2009 at 12:01 PM, David Hall wrote: > >> Would anyone be interested in a "compressed" serialization for >> DenseVector/SparseVector that follows in the vein of >> hadoop.io.Writable? The space overhead for gson (parsing issues >> not-withstanding) is pretty high, and it wouldn't be terribly hard to >> implement a high-performance thing for vectors. >> >
Re: MAHOUT-65
+10!!! How would you like to do it? Something like avro? Thrift? Homespun? On Thu, Jun 18, 2009 at 12:01 PM, David Hall wrote: > Would anyone be interested in a "compressed" serialization for > DenseVector/SparseVector that follows in the vein of > hadoop.io.Writable? The space overhead for gson (parsing issues > not-withstanding) is pretty high, and it wouldn't be terribly hard to > implement a high-performance thing for vectors. >
Re: MAHOUT-65
oh, wow, nevermind. Vector implements writable. Sorry everyone. -- David On Thu, Jun 18, 2009 at 12:19 PM, David Hall wrote: > actually, it looks like someone went to all the trouble to make both > SparseVector and DenseVector have all the methods required by > Writable, but they don't implement Writable. > > Could I just make Vector extend Writable? > > -- David > > On Thu, Jun 18, 2009 at 12:01 PM, David Hall wrote: >> following up on my earlier email. >> >> Would anyone be interested in a "compressed" serialization for >> DenseVector/SparseVector that follows in the vein of >> hadoop.io.Writable? The space overhead for gson (parsing issues >> not-withstanding) is pretty high, and it wouldn't be terribly hard to >> implement a high-performance thing for vectors. >> >> -- David >> >> On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman >> wrote: >>> +1, you added name constructors that I didn't have and the equals/equivalent >>> stuff. Ya, Gson makes it all pretty trivial once you grok it. >>> >>> >>> Grant Ingersoll wrote: Shall I take that as approval of the approach? BTW, the Gson stuff seems like a winner for serialization. On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote: > You gonna commit your patch? I agree with shortening the class name in > the JsonVectorAdapter and will do it once you commit ur stuff. > Jeff >>> >>> >> >
Re: MAHOUT-65
actually, it looks like someone went to all the trouble to make both SparseVector and DenseVector have all the methods required by Writable, but they don't implement Writable. Could I just make Vector extend Writable? -- David On Thu, Jun 18, 2009 at 12:01 PM, David Hall wrote: > following up on my earlier email. > > Would anyone be interested in a "compressed" serialization for > DenseVector/SparseVector that follows in the vein of > hadoop.io.Writable? The space overhead for gson (parsing issues > not-withstanding) is pretty high, and it wouldn't be terribly hard to > implement a high-performance thing for vectors. > > -- David > > On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman > wrote: >> +1, you added name constructors that I didn't have and the equals/equivalent >> stuff. Ya, Gson makes it all pretty trivial once you grok it. >> >> >> Grant Ingersoll wrote: >>> >>> Shall I take that as approval of the approach? >>> >>> BTW, the Gson stuff seems like a winner for serialization. >>> >>> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote: >>> You gonna commit your patch? I agree with shortening the class name in the JsonVectorAdapter and will do it once you commit ur stuff. Jeff >>> >>> >>> >>> >> >> >
Re: MAHOUT-65
following up on my earlier email. Would anyone be interested in a "compressed" serialization for DenseVector/SparseVector that follows in the vein of hadoop.io.Writable? The space overhead for gson (parsing issues not-withstanding) is pretty high, and it wouldn't be terribly hard to implement a high-performance thing for vectors. -- David On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman wrote: > +1, you added name constructors that I didn't have and the equals/equivalent > stuff. Ya, Gson makes it all pretty trivial once you grok it. > > > Grant Ingersoll wrote: >> >> Shall I take that as approval of the approach? >> >> BTW, the Gson stuff seems like a winner for serialization. >> >> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote: >> >>> You gonna commit your patch? I agree with shortening the class name in >>> the JsonVectorAdapter and will do it once you commit ur stuff. >>> Jeff >> >> >> >> > >
GSON stack overflows
GSON's parser is apparently not tale recursive. Opinions? In the meantime, I'm going to consider an alternative implementation in the meantime that doesn't involve serializing huge vectors. -- David java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:573) at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:65) at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:48) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) Caused by: com.google.gson.JsonParseException: Failed parsing JSON source: java.io.stringrea...@558964ad to Json at com.google.gson.JsonParser.parse(JsonParser.java:59) at com.google.gson.Gson.fromJson(Gson.java:376) at com.google.gson.Gson.fromJson(Gson.java:329) at com.google.gson.Gson.fromJson(Gson.java:305) at org.apache.mahout.matrix.JsonVectorAdapter.deserialize(JsonVectorAdapter.java:69) at org.apache.mahout.matrix.JsonVectorAdapter.deserialize(JsonVectorAdapter.java:35) at com.google.gson.JsonDeserializerExceptionWrapper.deserialize(JsonDeserializerExceptionWrapper.java:50) at com.google.gson.JsonDeserializationVisitor.visitUsingCustomHandler(JsonDeserializationVisitor.java:65) at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96) at com.google.gson.JsonDeserializationContextDefault.fromJsonObject(JsonDeserializationContextDefault.java:73) at com.google.gson.JsonDeserializationContextDefault.deserialize(JsonDeserializationContextDefault.java:49) at com.google.gson.Gson.fromJson(Gson.java:379) at com.google.gson.Gson.fromJson(Gson.java:329) at org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:326) at org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:310) at org.apache.mahout.clustering.lda.LDAReducer.reduce(LDAReducer.java:47) at org.apache.mahout.clustering.lda.LDAReducer.reduce(LDAReducer.java:40) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1116) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:989) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:401) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:886) Caused by: java.lang.StackOverflowError at com.google.gson.JsonParserJavacc.jj_3R_4(JsonParserJavacc.java:387) at com.google.gson.JsonParserJavacc.jj_3R_3(JsonParserJavacc.java:394) at com.google.gson.JsonParserJavacc.jj_3R_1(JsonParserJavacc.java:414) at com.google.gson.JsonParserJavacc.jj_3_1(JsonParserJavacc.java:400) at com.google.gson.JsonParserJavacc.jj_2_1(JsonParserJavacc.java:381) at com.google.gson.JsonParserJavacc.JsonNumber(JsonParserJavacc.java:229) at com.google.gson.JsonParserJavacc.JsonValue(JsonParserJavacc.java:166) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:142) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146) (etc)
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721351#action_12721351 ] Grant Ingersoll commented on MAHOUT-126: Yep, you are right. I committed your patch anyway. We probably should add to the cmd line to support setting minDF, maxDF. > Prepare document vectors from the text > -- > > Key: MAHOUT-126 > URL: https://issues.apache.org/jira/browse/MAHOUT-126 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.2 >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: mahout-126-benson.patch, > MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, > MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, > MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch > > > Clustering algorithms presently take the document vectors as input. > Generating these document vectors from the text can be broken in two tasks. > 1. Create lucene index of the input plain-text documents > 2. From the index, generate the document vectors (sparse) with weights as > TF-IDF values of the term. With lucene index, this value can be calculated > very easily. > Presently, I have created two separate utilities, which could possibly be > invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721346#action_12721346 ] David Hall commented on MAHOUT-126: --- That's not the only time. This constructor clearly lets certain things slip through. {code} public CachedTermInfo(IndexReader reader, String field, int minDf, int maxDfPercent) throws IOException { this.field = field; TermEnum te = reader.terms(new Term(field, "")); int count = 0; int numDocs = reader.numDocs(); double percent = numDocs * maxDfPercent / 100.0; //Should we use a linked hash map so that we no terms are in order? termEntries = new LinkedHashMap(); do { Term term = te.term(); if (term == null || term.field().equals(field) == false){ break; } int df = te.docFreq(); if (df < minDf || df > percent){ continue; } TermEntry entry = new TermEntry(term.text(), count++, df); termEntries.put(entry.term, entry); } while (te.next()); te.close(); {code} My code is essentially Lucene's demo indexing code (IndexFiles.java and FileDocument.java: http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.java&q=org.apache.lucene.demo.IndexFiles } except that I replaced {code}doc.add(new Field("contents", new FileReader(f)));{code} with {code} doc.add(new Field("contents", new FileReader(f),Field.TermVector.YES));{code} I then ran {code} java -cp org.apache.lucene.demo.IndexFiles /Users/dlwh/txt-reuters/ {code} and then {code} java -cp org.apache.mahout.utils.vectors.Driver --dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t /Users/dlwh/dict --weight TF {code} For what's it worth, it gives a null on "reuters", which is not usually a stop word, except that every single document ends with it, and so the IDF filtering above is catching it. > Prepare document vectors from the text > -- > > Key: MAHOUT-126 > URL: https://issues.apache.org/jira/browse/MAHOUT-126 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.2 >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: mahout-126-benson.patch, > MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, > MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, > MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch > > > Clustering algorithms presently take the document vectors as input. > Generating these document vectors from the text can be broken in two tasks. > 1. Create lucene index of the input plain-text documents > 2. From the index, generate the document vectors (sparse) with weights as > TF-IDF values of the term. With lucene index, this value can be calculated > very easily. > Presently, I have created two separate utilities, which could possibly be > invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-121: - Attachment: MAHOUT-121.patch Not sure if my very truly last version of the patch got posted. Here it is. It is relative to the root rather than trunk/ -- seems my hand editing doesn't work. > Speed up distance calculations for sparse vectors > - > > Key: MAHOUT-121 > URL: https://issues.apache.org/jira/browse/MAHOUT-121 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Reporter: Shashikant Kore > Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, > MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, > Mahout1211.patch > > > From my mail to the Mahout mailing list. > I am working on clustering a dataset which has thousands of sparse vectors. > The complete dataset has few tens of thousands of feature items but each > vector has only couple of hundred feature items. For this, there is an > optimization in distance calculation, a link to which I found the archives of > Mahout mailing list. > http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ > I tried out this optimization. The test setup had 2000 document vectors > with few hundred items. I ran canopy generation with Euclidean distance and > t1, t2 values as 250 and 200. > > Current Canopy Generation: 28 min 15 sec. > Canopy Generation with distance optimization: 1 min 38 sec. > I know by experience that using Integer, Double objects instead of primitives > is computationally expensive. I changed the sparse vector implementation to > used primitive collections by Trove [ > http://trove4j.sourceforge.net/ ]. > Distance optimization with Trove: 59 sec > Current canopy generation with Trove: 21 min 55 sec > To sum, these two optimizations reduced cluster generation time by a 97%. > Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. > > Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721215#action_12721215 ] Grant Ingersoll commented on MAHOUT-126: Hey David, I'm not sure what's going on here, because that value being null means the term is not the index, yet is in the Term Vector for that doc. Are you sure you're loading the same field? Can you share the indexing code? This fix works, though, but I'd like to know at a deeper level what's going on. > Prepare document vectors from the text > -- > > Key: MAHOUT-126 > URL: https://issues.apache.org/jira/browse/MAHOUT-126 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.2 >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: mahout-126-benson.patch, > MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, > MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, > MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch > > > Clustering algorithms presently take the document vectors as input. > Generating these document vectors from the text can be broken in two tasks. > 1. Create lucene index of the input plain-text documents > 2. From the index, generate the document vectors (sparse) with weights as > TF-IDF values of the term. With lucene index, this value can be calculated > very easily. > Presently, I have created two separate utilities, which could possibly be > invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [GSOC] Thoughts about Random forests map-reduce implementation
Ok then, I shall implement the easy mapreduce version and see how it behaves. > Ultimately, I would think that it is also interesting to modify the > original algorithm to build multiple trees for different portions of the > data. That loses some of the solidity of the original method, but could > actually do better if the splits exposed non-stationary behavior. very interesting, and it could make the map-reduce implementation capable of dealing with very large datasets. When you say : "build multiple trees for different portions of the data" What's the difference with the basic bagging algorithm, which builds 'each tree' using a different portion (about 2/3) of the data ? --- En date de : Mer 17.6.09, Ted Dunning a écrit : > De: Ted Dunning > Objet: Re: [GSOC] Thoughts about Random forests map-reduce implementation > À: mahout-dev@lucene.apache.org > Date: Mercredi 17 Juin 2009, 21h10 > This is a classic problem of scaling > a solution as the problem gets wide > (number of trees) and tall (amount of data). > > The problem of building a random forest on a large data set > with N trees is > N times the cost on a single node (as you point out) and N > is typically > about the number of cores available in a hadoop cluster or > a small multiple > thereof. This means that your simple solution would > give essentially > perfect speed up if the data set still fits in > memory. That means that a > simple implementation is likely to be of some use. > > On the other hand, it sounds like your Information Gain > computation has some > real performance problems that probably should be > addressed. > > Ultimately, I would think that it is also interesting to > modify the original > algorithm to build multiple trees for different portions of > the data. That > loses some of the solidity of the original method, but > could actually do > better if the splits exposed non-stationary behavior. > > On Wed, Jun 17, 2009 at 3:45 AM, deneche abdelhakim wrote: > > > > > As we talked about in the following discussion (A), > I'm considering two > > ways to implement a distributed map-reduce builder. > > > > Given the reference implementation, the easiest > implementation is the > > following: > > > > * the data is distributed to the slave nodes using the > DistributedCache > > * each mapper loader the data in-memory when in > JobConfigurable.configure() > > * each tree is built by one mapper > > ... > > * the main program builds the forest using > DecisionTree.parse(String) for > > each tree > > ... > > Cons: > > * because its based on the ref. implementation, it > will be very slow when > > dealing with large datasets >