Speed up Frequent Compile
When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin
[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Attachment: MAHOUT-237-tfidf.patch 4 Main Entry points DocumentProcessor - does SequenceFile = StringTuple(later replaced by StructuredDocumentWritable backed by AvroWritable) DictionaryVectorizer - StringTuple of documents = Tf Vector PartialVectorMerger - merges partial vectors based on their doc id. Does optional normalizing(used by both DictionaryVectorizer(no normalizing) and TFIDFConverter (optional normalizing0 TfidfConverter - Converts tf vector to tfidf vector with optional normalizing An example which uses all of them hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o reuters-vectors -w (tfidf|tf) --norm 2(works only when tfidf enabled not with tf) Map/Reduce Implementation of Document Vectorizer Key: MAHOUT-237 URL: https://issues.apache.org/jira/browse/MAHOUT-237 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch Current Vectorizer uses Lucene Index to convert documents into SparseVectors Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector This is a pure bag-of-words based Vectorizer written in Map/Reduce. The input document is in SequenceFileText,Text . with key = docid, value = content First Map/Reduce over the document collection and generate the feature counts. Second Sequential pass reads the output of the map/reduce and converts them to SequenceFileText, LongWritable where key=feature, value = unique id Second stage should create shards of features of a given split size Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors Fourth Map/Reduce over partial shard, group by docid, create full document Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Mahout 0.3 Plan and other changes
I am committing the first level of changes so that drew can work it. I have updated the patch on the issue as a reference. Ted please take a look when you get time. The names will change correspondingly What I have right now is 4 Main Entry points DocumentProcessor - does SequenceFile = StringTuple(later replaced by StructuredDocumentWritable backed by AvroWritable) DictionaryVectorizer - StringTuple of documents = Tf Vector PartialVectorMerger - merges partial vectors based on their doc id. Does optional normalizing(used by both DictionaryVectorizer(no normalizing) and TFIDFConverter (optional normalizing0 TfidfConverter - Converts tf vector to tfidf vector with optional normalizing An example which uses all of them hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o reuters-vectors -w (tfidf|tf) --norm 2(works only with tfidf for now) Robin On Fri, Feb 5, 2010 at 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote: Drew has an early code drop that should be posted shortly. He has a generic AvroWritable that can serialize anything with an appropriate schema. That changes your names and philosophy a bit. Regarding n-grams, I think that will be best combined with a non-dictionary based vectorizer because of the large implied vocabulary that would otherwise result. Also, in many cases vectorization and n-gram generation is best done in the learning algorithm itself to avoid moving massive amounts of data. As such, vectorization will probably need to be a library rather than a map-reduce program. On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil robin.a...@gmail.com wrote: Lets break it down into milestones. See if you agree on the following(even ClassNames ?) On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning ted.dunn...@gmail.com wrote: These are good questions. I see the best course as answering these kinds of questions in phases. First, the only thing that is working right now is the current text = vector stuff. We should continue to refine this with alternative forms of vectorization (random indexing, stochastic projection as well as the current dictionary approach). The input all these vectorization job is StucturedDocumentWritable format which you and Drew will work on(Avro based) To create the StructuredDocumentWritable format we have to write Mapreduces which will convert a) SequenceFile = SingleField token array using Analyzer I am going with simple Document = StucturedDocumentWritable(encapsulating StringTuple) in M1. Change it to StucturedDocumentWritable( in M2 b) Lucene Repo = StucturedDocumentWritable M2 c) Structured XML = StucturedDocumentWritable M2 d) Other Formats/DataSources(RDBMS) = StucturedDocumentWritable M3 Jobs using StructuredDocumentWritable a) DictionaryVectorizer - Makes VectorWritable M1 b) nGram Generator - Makes ngrams - 1) Appends to the dictionary - Creates Partial Vectors - Merges with vectors from Dictionary Vectorizer to create ngram based vectors M1 2) Appends to other vectorizers(random indexing, stochastic) M1? or M2 c) Random Indexing Job - Makes VectorWritable M1? or M2 d) StochasticProjection Job - Makes Vector writable M1? or M2 How does this sound ? Feel free to edit/reorder them A second step is to be able to store and represent more general documents similar to what is possible with Lucene. This is critically important for some of the things that I want to do where I need to store and segregate title, publisher, authors, abstracts and body text (and many other characteristics ... we probably have 100 of them). It is also critically important if we want to embrace the dualism between recommendation and search. Representing documents can be done without discarding the simpler approach we have now and it can be done in advance of good vectorization of these complex documents. A third step is to define advanced vectorization for complex documents. As an interim step, we can simply vectorize using the dictionary and alternative vectorizers that we have now, but applied to a single field of the document. Shortly, though, we should be able to define cross occurrence features for a multi-field vectorization. The only dependencies here are that the third step depends on the first and second. You have been working on the Dictionary vectorizer. I did a bit of work on stochastic projection with some cooccurrence. In parallel Drew and I have been working on building an Avro document schema. This is driving forward on step 2. I think that will actually bear some fruit quickly. Once that is done, we should merge capabilities. I am
[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Status: Patch Available (was: Reopened) Working Implementation DictionaryVectorizer using with tf, tfidf weighting and normalization. Map/Reduce Implementation of Document Vectorizer Key: MAHOUT-237 URL: https://issues.apache.org/jira/browse/MAHOUT-237 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch Current Vectorizer uses Lucene Index to convert documents into SparseVectors Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector This is a pure bag-of-words based Vectorizer written in Map/Reduce. The input document is in SequenceFileText,Text . with key = docid, value = content First Map/Reduce over the document collection and generate the feature counts. Second Sequential pass reads the output of the map/reduce and converts them to SequenceFileText, LongWritable where key=feature, value = unique id Second stage should create shards of features of a given split size Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors Fourth Map/Reduce over partial shard, group by docid, create full document Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Resolution: Fixed Status: Resolved (was: Patch Available) Map/Reduce Implementation of Document Vectorizer Key: MAHOUT-237 URL: https://issues.apache.org/jira/browse/MAHOUT-237 Project: Mahout Issue Type: New Feature Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch Current Vectorizer uses Lucene Index to convert documents into SparseVectors Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector This is a pure bag-of-words based Vectorizer written in Map/Reduce. The input document is in SequenceFileText,Text . with key = docid, value = content First Map/Reduce over the document collection and generate the feature counts. Second Sequential pass reads the output of the map/reduce and converts them to SequenceFileText, LongWritable where key=feature, value = unique id Second stage should create shards of features of a given split size Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors Fourth Map/Reduce over partial shard, group by docid, create full document Vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-220. --- Resolution: Fixed Committed. Mahout Bayes Code cleanup - Key: MAHOUT-220 URL: https://issues.apache.org/jira/browse/MAHOUT-220 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch Following isabel's checkstyle, I am adding a whole slew of code cleanup with the following exceptions 1. Line length used is 120 instead of 80. 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-221) Implementation of FP-Bonsai Pruning for fast pattern mining
[ https://issues.apache.org/jira/browse/MAHOUT-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-221. --- Resolution: Fixed Committed Implementation of FP-Bonsai Pruning for fast pattern mining --- Key: MAHOUT-221 URL: https://issues.apache.org/jira/browse/MAHOUT-221 Project: Mahout Issue Type: New Feature Components: Frequent Itemset/Association Rule Mining Affects Versions: 0.2 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 Attachments: MAHOUT-FPGROWTH.patch, MAHOUT-FPGROWTH.patch FP Bonsai is a method to prune long chained FP-Trees for faster growth. http://win.ua.ac.be/~adrem/bibrem/pubs/fpbonsai.pdf This implementation also adds a transaction preprocessing map/reduce job which converts a list of transactions {1, 2, 4, 5}, {1, 2, 3}, {1, 2} into a tree structure and thus saves space during fpgrowth map/reduce the tree formed from above is. For typical this improves the storage space by a great amount and thus saves on time during shuffle and sort (1,3) - (2,3) | - (4,1) - (5,1) (3,1) Also added a reducer to PFPgrowth (not part of the original paper) which does this compression and saves on space. This patch also adds an example transaction dataset generator from flickr and delicious data set https://www.uni-koblenz.de/FB4/Institutes/IFI/AGStaab/Research/DataSets/PINTSExperimentsDataSets/ Both of them are GIG of tag data. Where date userid itemid tag is given. The example maker creates a transaction based on all the unique tags a user has tagged on an item. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056 ] Robin Anil commented on MAHOUT-153: --- Any progress on this? Will it be ready soon or should it be pushed to 0.4 release ? Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Release thinking
Reviving this thread. Copy paste the whole thing as we move forward Current Snapshot Key Summary MAHOUT-221 Implementation of FP-Bonsai Pruning for fast pattern mining Done MAHOUT-227 Parallel SVM In Progress MAHOUT-240 Parallel version of Perceptron Little Progress MAHOUT-241 Example for perceptron Little Progress MAHOUT-185 Add mahout shell script for easy launching of various algorithms In Progress MAHOUT-153 Implement kmeans++ for initial cluster selection in kmeansLittle Progress (There is discussion, but no patch yet) MAHOUT-232 Implementation of sequential SVM solver based on PegasosIn Progress MAHOUT-228 Need sequential logistic regression implementation using SGD techniques In Progress MAHOUT-263 Matrix interface should extend IterableVector for better integration with distributed storage Done MAHOUT-237 Map/Reduce Implementation of Document Vectorizer Done MAHOUT-220 Mahout Bayes Code cleanup Done MAHOUT-265 Error with creating MVC from Lucene Index or Arff Done MAHOUT-215 Provide jars with mahout release. Done MAHOUT-209 Add aggregate() methods for Vector Done MAHOUT-231 Upgrade QM reports to use Clover 2.6Little Progress Not that required in the release(developer thing) MAHOUT-106 PLSI/EM in pig based on hofmann's ACM 04 paper.In Progress MAHOUT-155 ARFF VectorIterable Little Progress MAHOUT-214 Implement Stacked RBM Little Progress
[jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830077#action_12830077 ] Robin Anil commented on MAHOUT-185: --- I like the script as i am running k-means these days :) {code} if [ $COMMAND = vectordump ] ; then CLASS=org.apache.mahout.utils.vectors.VectorDumper elif [ $COMMAND = clusterdump ] ; then CLASS=org.apache.mahout.utils.clustering.ClusterDumper elif [ $COMMAND = seqdump ] ; then CLASS=org.apache.mahout.utils.SequenceFileDumper elif [ $COMMAND = kmeans ] ; then CLASS=org.apache.mahout.clustering.kmeans.KMeansDriver elif [ $COMMAND = canopy ] ; then CLASS=org.apache.mahout.clustering.canopy.CanopyDriver elif [ $COMMAND = lucenevector ]; then CLASS=org.apache.mahout.utils.vectors.lucene.Driver elif [ $COMMAND = seqdirectory ]; then CLASS=org.apache.mahout.text.SequenceFilesFromDirectory elif [ $COMMAND = seqwiki ]; then CLASS=org.apache.mahout.text.WikipediaToSequenceFile {code} If we go like this we might have too many options. Any way to streamline this ? One thought i have is to have package level Main classes in Core like org.apache.mahout.Clustering.java which internally calls the different main functions ? Similarly in examples and util we can keep One Entry class each Examples.java and Util.java So with this limited set we can keep a global conf object which implements Tool and the fs object which is the default filesystem as specified by the conf This way each algorithm can request a conf object (which copies everything Tool has set) How does that sound? I can whip up all the main classes tonight Add mahout shell script for easy launching of various algorithms Key: MAHOUT-185 URL: https://issues.apache.org/jira/browse/MAHOUT-185 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Environment: linux, bash Reporter: Robin Anil Fix For: 0.3 Attachments: MAHOUT-185.patch Currently, Each algorithm has a different point of entry. At its too complicated to understand and launch each one. A mahout shell script needs to be made in the bin directory which does something like the following mahout classify -algorithm bayes [OPTIONS] mahout cluster -algorithm canopy [OPTIONS] mahout fpm -algorithm pfpgrowth [OPTIONS] mahout taste -algorithm slopeone [OPTIONS] mahout misc -algorithm createVectorsFromText [OPTIONS] mahout examples WikipediaExample -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Proposing a C++ Port for Apache Mahout
One thought on these lines is that we should start the process to be a TLP, then we could have a subproject explicitly dedicated to C++ (or any other language) and there wouldn't necessarily need to be a 1-1 port. -Grant On Feb 5, 2010, at 12:56 AM, Kay Kay wrote: If there were an effort to write in C++ , it would definitely be useful and to exploit the maximum advantages, porting would be more beneficial over time compared to the wrapper, even if it were to apply to a subset of algorithms supported by Mahout. Wrapper, would serve the syntactic purpose, but when it comes to profiling / performance extraction would be a huge distraction then. But, as been pointed earlier - the algorithm depends on the M-R framework very much and hence , the success of this effort would also be tied to the Hadoop C/C++ port's maturity as well. Something worth noting before venturing along these lines. On 02/04/2010 09:22 AM, Atul Kulkarni wrote: Hey guys, My 1 cent... I would be really happy to contribute to this task of enabling use of Mahout via C++ (Wrapper / Port either way). I have some experience with C++ and have been wanting to use mahout via C++ (as that is my comfort zone compared to Java.). I think port will give the code directly in the hands of the C++ developers, which sounds really exciting to me as a C++ developer. But I also understand the concern of maintaining two different code bases for the same task, and hence also like the idea of writing wrappers. So I am divided on the two options, either works for me. Regards, Atul. On Thu, Feb 4, 2010 at 10:54 AM, Robin Anilrobin.a...@gmail.com wrote: Hi Israel. I think its a wonderful idea to have ports of mahout, it tells us that we have a great platform with people really want to use. The only concern is Hadoop is still in Java and they are not going with C++. They work around it by using native libraries to execute cpu intensive tasks like sorting and compressing. The reason being that Java is much easier to manage in such a distributed system(i guess lot of people may differ in opinion). Regardless, I guess wrappers could be made to ease execution of mahout algorithms from any language. If thats a solution you like then folks here can concentrate on improving just one code base. Robin On Thu, Feb 4, 2010 at 10:08 PM, Israel Ekpoisraele...@gmail.com wrote: Hey guys, First of all I would like to start by thanking all the commiters and contributors for all their hard work so far on this project. Most importantly, I want to thank the Apache Mahout community for bringing this very promising project to where it is now. It's pretty amazing to see what the project has accomplished in a short span of 2 years. I strongly believe that Apache Mahout is really going to change things around for the data mining and machine learning community the same way Apache Lucene and Apache Solr is taking over this sector as we speak. Currently Apache Mahout is only available in Java and there are a lot of tools in Mahout that is very useful and a lot of people (students, instructors, researchers and computer scientists are using it daily). I think it would be nice if all of these tools in Mahout were also available in C++ so that users that already have systems written in C++ can plug in an integrate Mahout a lot easier with their existing or planned C++ systems. If we have the C++ port up and running possibly more members of the data mining and machine learning community could get involved and ideas could be shuffled in both directions (Java and C++ port) I will volunteer to spearhead this porting effort to get things started. I am sending this message to all members of the Apache Mahout community on what you think can should be done to get this porting effort up and running. Thanks in advance for you constructive and anticipated responses. Sincerely, Israel Ekpo -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: Release thinking
I just marked the 0.1 and 0.2 releases as released (about time). This makes the JIRA road map feature more usable. See here for the live version of this summary: https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel On Fri, Feb 5, 2010 at 3:16 AM, Robin Anil robin.a...@gmail.com wrote: Reviving this thread. Copy paste the whole thing as we move forward Current Snapshot Key Summary MAHOUT-221 Implementation of FP-Bonsai Pruning for fast pattern mining Done MAHOUT-227 Parallel SVM In Progress MAHOUT-240 Parallel version of Perceptron Little Progress MAHOUT-241 Example for perceptron Little Progress MAHOUT-185 Add mahout shell script for easy launching of various algorithms In Progress MAHOUT-153 Implement kmeans++ for initial cluster selection in kmeansLittle Progress (There is discussion, but no patch yet) MAHOUT-232 Implementation of sequential SVM solver based on Pegasos In Progress MAHOUT-228 Need sequential logistic regression implementation using SGD techniques In Progress MAHOUT-263 Matrix interface should extend IterableVector for better integration with distributed storage Done MAHOUT-237 Map/Reduce Implementation of Document Vectorizer Done MAHOUT-220 Mahout Bayes Code cleanup Done MAHOUT-265 Error with creating MVC from Lucene Index or Arff Done MAHOUT-215 Provide jars with mahout release. Done MAHOUT-209 Add aggregate() methods for Vector Done MAHOUT-231 Upgrade QM reports to use Clover 2.6Little Progress Not that required in the release(developer thing) MAHOUT-106 PLSI/EM in pig based on hofmann's ACM 04 paper.In Progress MAHOUT-155 ARFF VectorIterable Little Progress MAHOUT-214 Implement Stacked RBM Little Progress -- Ted Dunning, CTO DeepDyve
Re: [jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms
Surely there is a clever way to use annotations for this. Not that I know what it might be. On Fri, Feb 5, 2010 at 4:05 AM, Robin Anil (JIRA) j...@apache.org wrote: If we go like this we might have too many options. Any way to streamline this ? One thought i have is to have package level Main classes in Core like org.apache.mahout.Clustering.java which internally calls the different main functions ? -- Ted Dunning, CTO DeepDyve
Re: Release thinking
Yum Yum. 0.1 59 issues 0.2 66 issues 0.3 91 issues - 13 left On Fri, Feb 5, 2010 at 9:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: I just marked the 0.1 and 0.2 releases as released (about time). This makes the JIRA road map feature more usable. See here for the live version of this summary: https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel On Fri, Feb 5, 2010 at 3:16 AM, Robin Anil robin.a...@gmail.com wrote: Reviving this thread. Copy paste the whole thing as we move forward Current Snapshot Key Summary MAHOUT-221 Implementation of FP-Bonsai Pruning for fast pattern mining Done MAHOUT-227 Parallel SVM In Progress MAHOUT-240 Parallel version of Perceptron Little Progress MAHOUT-241 Example for perceptron Little Progress MAHOUT-185 Add mahout shell script for easy launching of various algorithms In Progress MAHOUT-153 Implement kmeans++ for initial cluster selection in kmeansLittle Progress (There is discussion, but no patch yet) MAHOUT-232 Implementation of sequential SVM solver based on Pegasos In Progress MAHOUT-228 Need sequential logistic regression implementation using SGD techniques In Progress MAHOUT-263 Matrix interface should extend IterableVector for better integration with distributed storage Done MAHOUT-237 Map/Reduce Implementation of Document Vectorizer Done MAHOUT-220 Mahout Bayes Code cleanup Done MAHOUT-265 Error with creating MVC from Lucene Index or Arff Done MAHOUT-215 Provide jars with mahout release. Done MAHOUT-209 Add aggregate() methods for Vector Done MAHOUT-231 Upgrade QM reports to use Clover 2.6Little Progress Not that required in the release(developer thing) MAHOUT-106 PLSI/EM in pig based on hofmann's ACM 04 paper.In Progress MAHOUT-155 ARFF VectorIterable Little Progress MAHOUT-214 Implement Stacked RBM Little Progress -- Ted Dunning, CTO DeepDyve
[jira] Created: (MAHOUT-274) Use avro for serialization of structured documents.
Use avro for serialization of structured documents. --- Key: MAHOUT-274 URL: https://issues.apache.org/jira/browse/MAHOUT-274 Project: Mahout Issue Type: Improvement Reporter: Drew Farris Priority: Minor Explore the intersection between Writables and Avro to see how serialization can be improved within Mahout. An intermediate goal is the provide a structured document format that can be serialized using Avro as an Input/OutputFormat and Writable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.
[ https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-274: --- Attachment: mahout-avro-examples.tar.gz Very rudimentary exploration of using avro to produce writables. Uses the avro specific java class generation facility to produce a structured document class which is wrapped in a generic writable container for serialization. * clases on o.a.m.avro are produces from schema in src/main/schemata/o../a../m../avro/AvroDocument.avsc using o.a.m.avro.util.AvroDocumentCompiler * provides a generic avro Writable implementation in o.a.m.avro.mapred.SpecificAvroWritable * see the test in src/test/java o.a.m.avro.mapred.SpecificAvroWritableTest to see how this can be used 'mvn clean install' will run the whole shebang. Use avro for serialization of structured documents. --- Key: MAHOUT-274 URL: https://issues.apache.org/jira/browse/MAHOUT-274 Project: Mahout Issue Type: Improvement Reporter: Drew Farris Priority: Minor Attachments: mahout-avro-examples.tar.gz Explore the intersection between Writables and Avro to see how serialization can be improved within Mahout. An intermediate goal is the provide a structured document format that can be serialized using Avro as an Input/OutputFormat and Writable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Release thinking
On Fri, Feb 5, 2010 at 11:17 AM, Ted Dunning ted.dunn...@gmail.com wrote: I just marked the 0.1 and 0.2 releases as released (about time). This makes the JIRA road map feature more usable. See here for the live version of this summary: https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel Very nice, thanks Ted.
Re: Speed up Frequent Compile
On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Indeed. Robin, how are you doing your builds? I could have sworn I eliminated the building of tar, gzip, bzip2 files unless the -Prelease flag is specified.
Re: Speed up Frequent Compile
I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almost all the time. On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin -- Ted Dunning, CTO DeepDyve
Re: Release thinking
Makes a lot of sense. Drew? On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote: So are we really planning on all this structured document stuff and Avro for 0.3? Can we just try and finish up what was already scoped for 0.3 and have a quick turnaround for getting things which have only been really started worked on in the past week or so for 0.4 sometime next month? -- Ted Dunning, CTO DeepDyve
Re: Release thinking
On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote: So are we really planning on all this structured document stuff and Avro for 0.3? Can we just try and finish up what was already scoped for 0.3 and have a quick turnaround for getting things which have only been really started worked on in the past week or so for 0.4 sometime next month? Which is not to say that we shouldn't continue work on them, let's keep the patches going and up to date, let's just not worry about holding up 0.3 until they're fully tested and checked in. -jake
Re: Release thinking
Sounds great to me. On Fri, Feb 5, 2010 at 11:50 AM, Ted Dunning ted.dunn...@gmail.com wrote: Makes a lot of sense. Drew? On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote: So are we really planning on all this structured document stuff and Avro for 0.3? Can we just try and finish up what was already scoped for 0.3 and have a quick turnaround for getting things which have only been really started worked on in the past week or so for 0.4 sometime next month? -- Ted Dunning, CTO DeepDyve
Re: Release thinking
On Fri, Feb 5, 2010 at 11:53 AM, Jake Mannix jake.man...@gmail.com wrote: Which is not to say that we shouldn't continue work on them, let's keep the patches going and up to date, let's just not worry about holding up 0.3 until they're fully tested and checked in. Yes absolutely. I'm also interested in hearing Robin's thoughts on how far the current document vectorizer, n-gram work should go for 0.3 Drew
Re: Speed up Frequent Compile
mvn install to generate the job. around 2-3 mins it generates the bz2 zip gz mvn compile otherwise(15 secs are in compiling math) out of 33 sec On Fri, Feb 5, 2010 at 10:18 PM, Drew Farris drew.far...@gmail.com wrote: On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Indeed. Robin, how are you doing your builds? I could have sworn I eliminated the building of tar, gzip, bzip2 files unless the -Prelease flag is specified.
Re: Speed up Frequent Compile
Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which mahout never use indirectly I see loading jar takes 10-20 seconds when initializing mapper or reducer. It doesnt affect long running jobs but 20 sec overhead for processing 64MB chunk sucks On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almost all the time. On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin -- Ted Dunning, CTO DeepDyve
Re: Release thinking
I just updated it here. http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html Lets rename/refactor the classes and get basic avro thing in for 0.3. So that people who use gets a smooth upgrade to 0.4 Robin On Fri, Feb 5, 2010 at 10:32 PM, Drew Farris drew.far...@gmail.com wrote: On Fri, Feb 5, 2010 at 11:53 AM, Jake Mannix jake.man...@gmail.com wrote: Which is not to say that we shouldn't continue work on them, let's keep the patches going and up to date, let's just not worry about holding up 0.3 until they're fully tested and checked in. Yes absolutely. I'm also interested in hearing Robin's thoughts on how far the current document vectorizer, n-gram work should go for 0.3 Drew
[jira] Updated: (MAHOUT-272) Add licenses for 3rd party jars to mahout binary release and remove additional unused dependencies.
[ https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-272: - Resolution: Fixed Assignee: Drew Farris Status: Resolved (was: Patch Available) Add licenses for 3rd party jars to mahout binary release and remove additional unused dependencies. --- Key: MAHOUT-272 URL: https://issues.apache.org/jira/browse/MAHOUT-272 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Reporter: Drew Farris Assignee: Drew Farris Fix For: 0.3 Attachments: MAHOUT-272.patch The binary release produced by MAHOUT-215 includes some 3rd party jars that require licenses and other 3rd party jars (xpp3 + xstream) that are not required at all (eclipse core, a transitive dependency of hadoop, jfreechart a transitive dependency of watchmaker-swing). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Speed up Frequent Compile
So, I'm running: mvn -o install -DskipTests=true at project root (in mahout) Comment out or remove the maven-assembly-plugin definition in core/pom.xml -- it reduced my core build time from 26s to 6s -- I can submit a patch for this. Mahout math is still 17s here due to code generation. I'm wondering if there's a way to modify the generation plugin to that it doesn't re-generate if there are no changes to the templates. You can remove the plugin definition from math/pom.xml and it doesn't seem to break anything unless you're doing a clean. Brings down math compilation to 3s without it. Total compile time is 22s. re: the job, I'll have to look into that further later. On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com wrote: Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which mahout never use indirectly I see loading jar takes 10-20 seconds when initializing mapper or reducer. It doesnt affect long running jobs but 20 sec overhead for processing 64MB chunk sucks On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almost all the time. On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin -- Ted Dunning, CTO DeepDyve
Re: Proposing a C++ Port for Apache Mahout
Thanks everyone for your responses so far. The Apache Hadoop dependency was something I thought about initially but I still went ahead to ask the question anyways. At this time, it would be a better use of resources and time to come up with a wrapper or HTTP server/client set up of some sort. My reasoning behind this is because of the Hadoop dependency and the volatile nature of the API as pointed out by Sean and Robin Thanks again for all your responses. On Thu, Feb 4, 2010 at 12:22 PM, Atul Kulkarni atulskulka...@gmail.comwrote: Hey guys, My 1 cent... I would be really happy to contribute to this task of enabling use of Mahout via C++ (Wrapper / Port either way). I have some experience with C++ and have been wanting to use mahout via C++ (as that is my comfort zone compared to Java.). I think port will give the code directly in the hands of the C++ developers, which sounds really exciting to me as a C++ developer. But I also understand the concern of maintaining two different code bases for the same task, and hence also like the idea of writing wrappers. So I am divided on the two options, either works for me. Regards, Atul. On Thu, Feb 4, 2010 at 10:54 AM, Robin Anil robin.a...@gmail.com wrote: Hi Israel. I think its a wonderful idea to have ports of mahout, it tells us that we have a great platform with people really want to use. The only concern is Hadoop is still in Java and they are not going with C++. They work around it by using native libraries to execute cpu intensive tasks like sorting and compressing. The reason being that Java is much easier to manage in such a distributed system(i guess lot of people may differ in opinion). Regardless, I guess wrappers could be made to ease execution of mahout algorithms from any language. If thats a solution you like then folks here can concentrate on improving just one code base. Robin On Thu, Feb 4, 2010 at 10:08 PM, Israel Ekpo israele...@gmail.com wrote: Hey guys, First of all I would like to start by thanking all the commiters and contributors for all their hard work so far on this project. Most importantly, I want to thank the Apache Mahout community for bringing this very promising project to where it is now. It's pretty amazing to see what the project has accomplished in a short span of 2 years. I strongly believe that Apache Mahout is really going to change things around for the data mining and machine learning community the same way Apache Lucene and Apache Solr is taking over this sector as we speak. Currently Apache Mahout is only available in Java and there are a lot of tools in Mahout that is very useful and a lot of people (students, instructors, researchers and computer scientists are using it daily). I think it would be nice if all of these tools in Mahout were also available in C++ so that users that already have systems written in C++ can plug in an integrate Mahout a lot easier with their existing or planned C++ systems. If we have the C++ port up and running possibly more members of the data mining and machine learning community could get involved and ideas could be shuffled in both directions (Java and C++ port) I will volunteer to spearhead this porting effort to get things started. I am sending this message to all members of the Apache Mahout community on what you think can should be done to get this porting effort up and running. Thanks in advance for you constructive and anticipated responses. Sincerely, Israel Ekpo -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/ -- Regards, Atul Kulkarni www.d.umn.edu/~kulka053 http://www.d.umn.edu/%7Ekulka053 -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: Speed up Frequent Compile
Yes, the codegen could drop a timestamp file. It's a fair amount of work, and if we're killing this code for HPCC I'm dubious. If I could make the split work I could do this next. On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com wrote: So, I'm running: mvn -o install -DskipTests=true at project root (in mahout) Comment out or remove the maven-assembly-plugin definition in core/pom.xml -- it reduced my core build time from 26s to 6s -- I can submit a patch for this. Mahout math is still 17s here due to code generation. I'm wondering if there's a way to modify the generation plugin to that it doesn't re-generate if there are no changes to the templates. You can remove the plugin definition from math/pom.xml and it doesn't seem to break anything unless you're doing a clean. Brings down math compilation to 3s without it. Total compile time is 22s. re: the job, I'll have to look into that further later. On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com wrote: Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which mahout never use indirectly I see loading jar takes 10-20 seconds when initializing mapper or reducer. It doesnt affect long running jobs but 20 sec overhead for processing 64MB chunk sucks On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almost all the time. On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin -- Ted Dunning, CTO DeepDyve
Re: Proposing a C++ Port for Apache Mahout
Grant, Would the TLP be Mahout or under a different name? I also like the idea that it does not necessarily have to be a 1:1 port. Kay Kay, I change my mind (going the wrapper route), I think it would be nice to explore the possibilities with just a subset of the algorithms. That would be a good place to start. I will be in touch On Feb 5, 2010, at 03:23 PM, Grant Ingersoll wrote: One thought on these lines is that we should start the process to be a TLP, then we could have a subproject explicitly dedicated to C++ (or any other language) and there wouldn't necessarily need to be a 1-1 port. -Grant On Feb 5, 2010, at 12:56 AM, Kay Kay wrote: If there were an effort to write in C++ , it would definitely be useful and to exploit the maximum advantages, porting would be more beneficial over time compared to the wrapper, even if it were to apply to a subset of algorithms supported by Mahout. Wrapper, would serve the syntactic purpose, but when it comes to profiling / performance extraction would be a huge distraction then. But, as been pointed earlier - the algorithm depends on the M-R framework very much and hence , the success of this effort would also be tied to the Hadoop C/C++ port's maturity as well. Something worth noting before venturing along these lines. On Fri, Feb 5, 2010 at 3:41 PM, Israel Ekpo israele...@gmail.com wrote: Thanks everyone for your responses so far. The Apache Hadoop dependency was something I thought about initially but I still went ahead to ask the question anyways. At this time, it would be a better use of resources and time to come up with a wrapper or HTTP server/client set up of some sort. My reasoning behind this is because of the Hadoop dependency and the volatile nature of the API as pointed out by Sean and Robin Thanks again for all your responses. On Thu, Feb 4, 2010 at 12:22 PM, Atul Kulkarni atulskulka...@gmail.comwrote: Hey guys, My 1 cent... I would be really happy to contribute to this task of enabling use of Mahout via C++ (Wrapper / Port either way). I have some experience with C++ and have been wanting to use mahout via C++ (as that is my comfort zone compared to Java.). I think port will give the code directly in the hands of the C++ developers, which sounds really exciting to me as a C++ developer. But I also understand the concern of maintaining two different code bases for the same task, and hence also like the idea of writing wrappers. So I am divided on the two options, either works for me. Regards, Atul. On Thu, Feb 4, 2010 at 10:54 AM, Robin Anil robin.a...@gmail.com wrote: Hi Israel. I think its a wonderful idea to have ports of mahout, it tells us that we have a great platform with people really want to use. The only concern is Hadoop is still in Java and they are not going with C++. They work around it by using native libraries to execute cpu intensive tasks like sorting and compressing. The reason being that Java is much easier to manage in such a distributed system(i guess lot of people may differ in opinion). Regardless, I guess wrappers could be made to ease execution of mahout algorithms from any language. If thats a solution you like then folks here can concentrate on improving just one code base. Robin On Thu, Feb 4, 2010 at 10:08 PM, Israel Ekpo israele...@gmail.com wrote: Hey guys, First of all I would like to start by thanking all the commiters and contributors for all their hard work so far on this project. Most importantly, I want to thank the Apache Mahout community for bringing this very promising project to where it is now. It's pretty amazing to see what the project has accomplished in a short span of 2 years. I strongly believe that Apache Mahout is really going to change things around for the data mining and machine learning community the same way Apache Lucene and Apache Solr is taking over this sector as we speak. Currently Apache Mahout is only available in Java and there are a lot of tools in Mahout that is very useful and a lot of people (students, instructors, researchers and computer scientists are using it daily). I think it would be nice if all of these tools in Mahout were also available in C++ so that users that already have systems written in C++ can plug in an integrate Mahout a lot easier with their existing or planned C++ systems. If we have the C++ port up and running possibly more members of the data mining and machine learning community could get involved and ideas could be shuffled in both directions (Java and C++ port) I will volunteer to spearhead this porting effort to get things started. I am sending this message to all members of the Apache Mahout community on what you think can should be done to get this porting effort up and running. Thanks in advance for you constructive and anticipated
Re: Speed up Frequent Compile
Its just meant to be a dev only hack :) On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies bimargul...@gmail.comwrote: Yes, the codegen could drop a timestamp file. It's a fair amount of work, and if we're killing this code for HPCC I'm dubious. If I could make the split work I could do this next. On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com wrote: So, I'm running: mvn -o install -DskipTests=true at project root (in mahout) Comment out or remove the maven-assembly-plugin definition in core/pom.xml -- it reduced my core build time from 26s to 6s -- I can submit a patch for this. Mahout math is still 17s here due to code generation. I'm wondering if there's a way to modify the generation plugin to that it doesn't re-generate if there are no changes to the templates. You can remove the plugin definition from math/pom.xml and it doesn't seem to break anything unless you're doing a clean. Brings down math compilation to 3s without it. Total compile time is 22s. re: the job, I'll have to look into that further later. On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com wrote: Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which mahout never use indirectly I see loading jar takes 10-20 seconds when initializing mapper or reducer. It doesnt affect long running jobs but 20 sec overhead for processing 64MB chunk sucks On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almost all the time. On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin -- Ted Dunning, CTO DeepDyve
Re: Speed up Frequent Compile
Then we could make a profile that turns off the code gen and turns on the build helper to add the generated source dir instead. On Fri, Feb 5, 2010 at 4:49 PM, Robin Anil robin.a...@gmail.com wrote: Its just meant to be a dev only hack :) On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies bimargul...@gmail.comwrote: Yes, the codegen could drop a timestamp file. It's a fair amount of work, and if we're killing this code for HPCC I'm dubious. If I could make the split work I could do this next. On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com wrote: So, I'm running: mvn -o install -DskipTests=true at project root (in mahout) Comment out or remove the maven-assembly-plugin definition in core/pom.xml -- it reduced my core build time from 26s to 6s -- I can submit a patch for this. Mahout math is still 17s here due to code generation. I'm wondering if there's a way to modify the generation plugin to that it doesn't re-generate if there are no changes to the templates. You can remove the plugin definition from math/pom.xml and it doesn't seem to break anything unless you're doing a clean. Brings down math compilation to 3s without it. Total compile time is 22s. re: the job, I'll have to look into that further later. On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com wrote: Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which mahout never use indirectly I see loading jar takes 10-20 seconds when initializing mapper or reducer. It doesnt affect long running jobs but 20 sec overhead for processing 64MB chunk sucks On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almost all the time. On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote: When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin -- Ted Dunning, CTO DeepDyve
Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]
Jeff Eastman wrote: Jeff Eastman wrote: Jeff Eastman wrote: Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha. Could you elaborate more on the function of alpha in the algorithm? Now I can answer my own question. Alpha_0 determines the probability a point will go into an empty cluster (ok, almost Ted's exact words). During the first iteration, the total counts of all prior clusters are zero. Thus the Beta calculation that drives the Dirichlet distribution that determines the mixture probabilities degenerates to beta = rBeta(1, alpha_0). Clusters that end up with points for the next iteration will overwhelm the small constants (alpha_0, 1) and subsequent new mixture probabilities will derive from beta ~= rBeta(count, total) which is the current implementation. All empty clusters will subsequently be driven by beta ~= rBeta(1, total) as alpha_0 is insignificant and count is 0. The current implementation ends up using beta = rBeta(alpha_0/k, alpha_0) as initial values during all iterations because the counts are all initialized to alpha_0/k. Close but no cigar. Jeff (nothing new below) Looking at the current implementation, it is only used to initialize the totalCount values (to alpha/k) when sampling from the prior. AFAICT it is not used anywhere else. Its current role is pretty minimal and I wonder if something fell through the cracks during all of the refactoring from the R prototype. Well, I looked over the R code and alpha_0 does appear to be used in two places, not one: - in state initialization beta = rbeta(K, 1, alpha_0) [K is the number of models] - during state update beta[k] = rbeta(1, 1 + counts[k], alpha_0 + N-counts[k]) [N is the cardinality of the sample vector and counts corresponds to totalCounts in the implementation] The value of beta[k] is then used in the Dirichlet distribution calculation which results in the mixture probabilities pi[i], for the iteration: other = 1 # product accumulator for (k in 1:K) { pi[k] = beta[k] * other;# beta_k * prod_{nk} beta_n other = other * (1-beta[k]) } Alpha_0 does not appear to ever be added to the total counts nor is it divided by K as in the implementation so it looks like something did get lost in the refactoring. In the implementation, UncommonDistributions.rDirichlet(Vector alpha) is passed the totalCounts to compute the mixture probabilities and the rBeta arguments do not use alpha_0 as in R. There are other differences; however, and rDirichlet looks like: public static Vector rDirichlet(Vector alpha) { Vector r = alpha.like(); double total = alpha.zSum(); double remainder = 1; for (int i = 0; i r.size(); i++) { double a = alpha.get(i); total -= a; double beta = rBeta(a, Math.max(0, total)); double p = beta * remainder; r.set(i, p); remainder -= p; } return r; } Hi Ted, I made the following changes, which still seem to work. I added alpha_0 as an argument to rDirichlet and included it in the beta calculation. I also removed the alpha_0/k totalCount initialization. This now corresponds, I think, to the R code above and degenerates to the same initial beta arguments during initialization when totalCounts are 0. Could you please look this over and see if you agree? Thanks, Jeff /** * Sample from a Dirichlet distribution, returning a vector of probabilities using a * stick-breaking algorithm * * @param totalCounts an unnormalized count Vector * @param alpha_0 a double * @return a Vector of probabilities */ public static Vector rDirichlet(Vector totalCounts, double alpha_0) { Vector result = totalCounts.like(); double total = totalCounts.zSum(); double other = 1.0; for (int i = 0; i result.size(); i++) { double count = totalCounts.get(i); total -= count; double beta = rBeta(1 + count, Math.max(0, alpha_0 + total)); double pi = beta * other; result.set(i, pi); other *= 1 - beta; } return result; }