Re: Mahout 0.9 Release Notes - First Draft
Hi, one thing I forgot: you once mentioned running into issues with the new kmeans - are those fixed or tracked in jira? In case of the latter we should include a known issues/ call for helping hands section. Isabel
[jira] [Created] (MAHOUT-1387) Create page for release notes
Isabel Drost-Fromm created MAHOUT-1387: -- Summary: Create page for release notes Key: MAHOUT-1387 URL: https://issues.apache.org/jira/browse/MAHOUT-1387 Project: Mahout Issue Type: Improvement Components: Documentation Affects Versions: 0.8 Reporter: Isabel Drost-Fromm Priority: Minor Starting 0.6 our release notes are published on our main web page - interleaved with other news items. For reference it would be good to have one canonical go-to page for past release notes on our main Apache CMS powered web page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1305) Rework the wiki
[ https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855804#comment-13855804 ] Isabel Drost-Fromm commented on MAHOUT-1305: * Pages now available on the CMS moved under DeletionCandidates parent: Please double check - if I hear nothing until Dec 28th I'll delete them. * Pages with bogus content or nearly no content deleted. * Moved all pages that I remembered being referred to under RedirectPages, editing each to contain a link that points to the current CMS based version. There wasn't a whole lot left: Three pages looked like they could be valuable moved to CMS Migration Candidates. I'd keep the remaining for simply in the wiki. Concerning the CSS: I couldn't change the CSS (not even with my recovered isabel account.) However for the few pages I thought could be valuable to keep it helped to explicitly set them to be left aligned in the edit box. Concerning the link to the wiki - probably as someone asked me for a link after publishing the new main web page I had forgotten that this is already available on our main web site (check the entry in the General tab). Let me know whether this link should be someplace more prominent (keep in mind though that given that we have Apache CMS now there probably won't be a whole lot of content left to put into the wiki anyway.) Rework the wiki --- Key: MAHOUT-1305 URL: https://issues.apache.org/jira/browse/MAHOUT-1305 Project: Mahout Issue Type: Bug Components: Documentation Reporter: Sebastian Schelter Priority: Blocker Fix For: 0.9 Attachments: MAHOUT-221213-1315-15716.pdf We should think about completely redoing our wiki. At the moment, we're listing lots of algorithms that we either never implemented or already removed. I also have the impression that a lot of stuff is outdated. It would be awesome if we had an up-to-date documentation of the code with instructions on how to get into using mahout quickly. We should also have examples for all our 3 C's. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Streaming KMeans clustering
Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
Re: Streaming KMeans clustering
That the algorithm runs a single reducer is expected. The algorithm creates a sketch of the data in parallel in the map-phase, which is collected by the reducer afterwards. The reducer then applies an expensive in-memory clustering algorithm to the sketch. Which dataset are you using for testing? I can also do some tests on a cluster here. I can imagine two possible causes for the problems: Maybe there's a problem with the vectors and some calculations take very long because the wrong access pattern or implementation is chosen. Another problem could be that the mappers and reducers have too few memory and spend a lot of time running garbage collections. --sebastian On 23.12.2013 22:14, Suneel Marthi wrote: Has anyone be successful running Streaming KMeans clustering on a large dataset ( 100,000 points)? It just seems to take a very long time ( 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.
[jira] [Commented] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true
[ https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855974#comment-13855974 ] Hudson commented on MAHOUT-1358: SUCCESS: Integrated in Mahout-Quality #2381 (See [https://builds.apache.org/job/Mahout-Quality/2381/]) MAHOUT-1358 - earlier fix for this issue throws a heap space exception for large datasets during the Mapper phase, new fix in place now and code cleanup. (smarthi: rev 1553189) * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true - Key: MAHOUT-1358 URL: https://issues.apache.org/jira/browse/MAHOUT-1358 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 0.9 Attachments: MAHOUT-1358.patch Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true, throws the following error {Code} java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:120) at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176) at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) {Code} The issue is caused by the following code in StreamingKMeansThread.call() {Code} IteratorCentroid datapointsIterator = datapoints.iterator(); if (estimateDistanceCutoff == StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) { ListCentroid estimatePoints = Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS); while (datapointsIterator.hasNext() estimatePoints.size() NUM_ESTIMATE_POINTS) { estimatePoints.add(datapointsIterator.next()); } estimateDistanceCutoff = ClusteringUtils.estimateDistanceCutoff(estimatePoints, searcher.getDistanceMeasure()); } StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, estimateDistanceCutoff); while (datapointsIterator.hasNext()) { clusterer.cluster(datapointsIterator.next()); } {Code} The code is using the same iterator twice, and it fails on the second use for obvious reasons. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Jenkins build is back to normal : Mahout-Quality #2381
See https://builds.apache.org/job/Mahout-Quality/2381/changes
Re: Mahout 0.9 Release Notes - First Draft
On Sat, Dec 21, 2013 at 6:28 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Hi All, Please see below the first draft of Release notes for Mahout 0.9. Please feel free to add/edit sections as u see fit. (This is a draft only). Regards, Suneel - The Apache Mahout PMC is pleased to announce the release of Mahout 0.9. Mahout's goal is to build scalable machine learning libraries focused primarily in the areas of collaborative filtering (recommenders), clustering and classification (known collectively as the 3Cs), as well as the necessary infrastructure to support those implementations including, but not limited to, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and much more. The 0.9 release is mainly a clean up release in preparation for an upcoming 1.0 release targeted for first half of 2014, but there are a few significant new features, which are highlighted below. To get started with Apache Mahout 0.9, download the release artifacts and signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven repository. In addition to the release highlights and artifacts, please pay attention to the section labelled FUTURE PLANS below for more information about upcoming releases of Mahout. As with any release, we wish to thank all of the users and contributors to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for individual credits, as there are too many to list here. GETTING STARTED In the release package, the examples directory contains several working examples of the core functionality available in Mahout. These can be run via scripts in the examples/bin directory and will prompt you for more information to help you try things out. Most examples do not need a Hadoop cluster in order to run. RELEASE HIGHLIGHTS The highlights of the Apache Mahout 0.9 release include, but are not limited to the list below. For further information, see the included CHANGELOG file. - Scala DSL Bindings for Mahout Math Linear Algebra (MAHOUT-1297). See http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html - New Multilayer Perceptron Classifier (MAHOUT-1265) - Recommenders as a Search (MAHOUT-1288). See https://github.com/pferrel/solr-recommender - MAHOUT-1364: Upgrade Mahout to be Lucene 4.6.0 compliant - MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering See https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdffor the details. - Removed Deprecated algorithms. - the usual bug fixes. See JIRA [?} for more information on the 0.9 release. A total 91 separate JIRA issues were addressed in this release. The following algorithms that were marked deprecated in 0.8 have been removed in 0.9: - From Clustering: Dirichlet - replaced by Collapsible Variational Bayes (CVB) I think the name of the method i commonly hear is Collapsed Variational Bayes Meanshift MinHash - removed due to poor performance and lack of usage EigenCuts - - From Classification (both are sequential implementations) Winnow - lack of actual usage Perceptron - lack of actual usage - Frequent Pattern Mining - Collaborative Filtering All recommenders in org.apache.mahout.cf.taste.impl.recommender.knn SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender - Mahout Math Lanczos in favour of SSVD Hadoop entropy stuff in org.apache.mahout.math.stats.entropy If you are interested in supporting 1 or more of these algorithms, please make it known on dev@mahout.apache.org and via JIRA issues that fix and/or improve them. Please also provide supporting evidence as to their effectiveness for you in production. CONTRIBUTING Mahout is always looking for contributions focused on the 3Cs. If you are interested in contributing, please see our contribution page, https://cwiki.apache.org/MAHOUT/how-to-contribute.html, on the Mahout wiki or contact us via email at dev@mahout.apache.org. FUTURE PLANS 1.0 Plans - New Downpour SGD classifier - Support for Finite State Transducers (FST) as a Dictionary Type. - Support for Hadoop 2.x - Port Mahout's recommenders to Spark (??) - Support for Java 7 - Better API interfaces for Clustering - (what else???) As the project moves towards a 1.0 release, the community will be focused on key algorithms that are proven to scale in
Re: Mahout 0.9 Release Notes - First Draft
On Sun, Dec 22, 2013 at 11:21 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: - Mahout Math Lanczos in favour of SSVD IIRC, we agreed to not remove Lanczos, although it was initially deprecated. We should undeprecate it. Some folks like Lanczos in Mahout (for reasons not really clear to me, aside from accuracy when computing svd of a random noise, there are actually 0 reasons to use Lanczos instead). I agree we don't necessarily want to cull it out -- but IMO there should be a clear steer posted in favor of SSVD in the docs/javadocs.
Re: Mahout 0.9 Release Notes - First Draft
Suneel ran into some issues this weekend; I'm going to try it out and see if I can repro. On Dec 23, 2013, at 1:02 AM, Isabel Drost-Fromm isa...@apache.org wrote: Hi, one thing I forgot: you once mentioned running into issues with the new kmeans - are those fixed or tracked in jira? In case of the latter we should include a known issues/ call for helping hands section. Isabel
[jira] [Created] (MAHOUT-1388) Add command line support and logging for MLP
Yexi Jiang created MAHOUT-1388: -- Summary: Add command line support and logging for MLP Key: MAHOUT-1388 URL: https://issues.apache.org/jira/browse/MAHOUT-1388 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 1.0 Reporter: Yexi Jiang Fix For: 1.0 The user should have the ability to run the Perceptron from the command line. There are two modes for MLP, the training and labeling, the first one takes the data as input and outputs the model, the second one takes the model and unlabeled data as input and outputs the results. The parameters are as follows: --mode -mo // train or label --input -i (input data) --model -mo // in training mode, this is the location to store the model (if the specified location has an existing model, it will update the model through incremental learning), in labeling mode, this is the location to store the result --output -o // this is only useful in labeling mode --layersize -ls (no. of units per hidden layer) // use comma separated number to indicate the number of neurons in each layer (including input layer and output layer) --momentum -m --learningrate -l --regularizationweight -r --costfunction -cf // the type of cost function, For example, train a 3-layer (including input, hidden, and output) MLP with Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the parameter would be: mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01 -cf minus_squared This command would read the training data from /tmp/training-data.csv and write the trained model to /tmp/model.model. If a user need to use an existing model, it will use the following command: mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result Moreover, we should be providing default values if the user does not specify any. -- This message was sent by Atlassian JIRA (v6.1.5#6160)