Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Isabel Drost-Fromm
Hi,

one thing I forgot: you once mentioned running into issues with the new kmeans 
- are those fixed or tracked in jira? In case of the latter we should include a 
known issues/ call for helping hands section.

Isabel


[jira] [Created] (MAHOUT-1387) Create page for release notes

2013-12-23 Thread Isabel Drost-Fromm (JIRA)
Isabel Drost-Fromm created MAHOUT-1387:
--

 Summary: Create page for release notes
 Key: MAHOUT-1387
 URL: https://issues.apache.org/jira/browse/MAHOUT-1387
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.8
Reporter: Isabel Drost-Fromm
Priority: Minor


Starting 0.6 our release notes are published on our main web page - interleaved 
with other news items.

For reference it would be good to have one canonical go-to page for past 
release notes on our main Apache CMS powered web page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1305) Rework the wiki

2013-12-23 Thread Isabel Drost-Fromm (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855804#comment-13855804
 ] 

Isabel Drost-Fromm commented on MAHOUT-1305:



* Pages now available on the CMS moved under DeletionCandidates parent:


Please double check - if I hear nothing until Dec 28th I'll delete them.



* Pages with bogus content or nearly no content deleted.
* Moved all pages that I remembered being referred to under RedirectPages, 
editing each to contain a link that points to the current CMS based version.

There wasn't a whole lot left: Three pages looked like they could be valuable 
moved to CMS Migration Candidates. I'd keep the remaining for simply in the 
wiki.

Concerning the CSS: I couldn't change the CSS (not even with my recovered 
isabel account.) However for the few pages I thought could be valuable to 
keep it helped to explicitly set them to be left aligned in the edit box.

Concerning the link to the wiki - probably as someone asked me for a link after 
publishing the new main web page I had forgotten that this is already available 
on our main web site (check the entry in the General tab). Let me know 
whether this link should be someplace more prominent (keep in mind though that 
given that we have Apache CMS now there probably won't be a whole lot of 
content left to put into the wiki anyway.)

 Rework the wiki
 ---

 Key: MAHOUT-1305
 URL: https://issues.apache.org/jira/browse/MAHOUT-1305
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Reporter: Sebastian Schelter
Priority: Blocker
 Fix For: 0.9

 Attachments: MAHOUT-221213-1315-15716.pdf


 We should think about completely redoing our wiki. At the moment, we're 
 listing lots of algorithms that we either never implemented or already 
 removed. I also have the impression that a lot of stuff is outdated.
 It would be awesome if we had an up-to-date documentation of the code with 
 instructions on how to get into using mahout quickly.
 We should also have examples for all our 3 C's.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Streaming KMeans clustering

2013-12-23 Thread Suneel Marthi
Has anyone be successful running Streaming KMeans clustering on a large dataset 
( 100,000 points)?


It just seems to take a very long time ( 4hrs) for the mappers to finish on 
about 300K data points and the reduce phase has only a single reducer running 
and throws an OOM failing the job several hours after the job has been kicked 
off.

Its the same story when trying to run in sequential mode.

Looking at the code the bottleneck seems to be in 
StreamingKMeans.clusterInternal(), without understanding the behaviour of the 
algorithm I am not sure if the sequence of steps in there is correct. 


There are few calls that call themselves repeatedly over and over again like 
SteamingKMeans.clusterInternal() and Searcher.searchFirst().

We really need to have this working on datasets that are larger than 20K 
reuters datasets.

I am trying to run this on 300K vectors with k= 100, km = 1261 and 
FastProjectSearch.


Re: Streaming KMeans clustering

2013-12-23 Thread Sebastian Schelter
That the algorithm runs a single reducer is expected. The algorithm
creates a sketch of the data in parallel in the map-phase, which is
collected by the reducer afterwards. The reducer then applies an
expensive in-memory clustering algorithm to the sketch.

Which dataset are you using for testing? I can also do some tests on a
cluster here.

I can imagine two possible causes for the problems: Maybe there's a
problem with the vectors and some calculations take very long because
the wrong access pattern or implementation is chosen.

Another problem could be that the mappers and reducers have too few
memory and spend a lot of time running garbage collections.

--sebastian


On 23.12.2013 22:14, Suneel Marthi wrote:
 Has anyone be successful running Streaming KMeans clustering on a large 
 dataset ( 100,000 points)?
 
 
 It just seems to take a very long time ( 4hrs) for the mappers to finish on 
 about 300K data points and the reduce phase has only a single reducer running 
 and throws an OOM failing the job several hours after the job has been kicked 
 off.
 
 Its the same story when trying to run in sequential mode.
 
 Looking at the code the bottleneck seems to be in 
 StreamingKMeans.clusterInternal(), without understanding the behaviour of the 
 algorithm I am not sure if the sequence of steps in there is correct. 
 
 
 There are few calls that call themselves repeatedly over and over again like 
 SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
 We really need to have this working on datasets that are larger than 20K 
 reuters datasets.
 
 I am trying to run this on 300K vectors with k= 100, km = 1261 and 
 FastProjectSearch.
 



[jira] [Commented] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true

2013-12-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855974#comment-13855974
 ] 

Hudson commented on MAHOUT-1358:


SUCCESS: Integrated in Mahout-Quality #2381 (See 
[https://builds.apache.org/job/Mahout-Quality/2381/])
MAHOUT-1358 - earlier fix for this issue throws a heap space exception for 
large datasets during the Mapper phase, new fix in place now and code cleanup. 
(smarthi: rev 1553189)
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java


 StreamingKMeansThread throws IllegalArgumentException when 
 REDUCE_STREAMING_KMEANS is set to true
 -

 Key: MAHOUT-1358
 URL: https://issues.apache.org/jira/browse/MAHOUT-1358
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.9

 Attachments: MAHOUT-1358.patch


 Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true, 
 throws the following error
 {Code}
 java.lang.IllegalArgumentException: Must have nonzero number of training and 
 test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0]
   at 
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:120)
   at 
 org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
   at 
 org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
   at 
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
   at 
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
   at 
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
   at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
 {Code}
 The issue is caused by the following code in StreamingKMeansThread.call()
 {Code}
 IteratorCentroid datapointsIterator = datapoints.iterator();
 if (estimateDistanceCutoff == 
 StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) {
   ListCentroid estimatePoints = 
 Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS);
   while (datapointsIterator.hasNext()  estimatePoints.size()  
 NUM_ESTIMATE_POINTS) {
 estimatePoints.add(datapointsIterator.next());
   }
   estimateDistanceCutoff = 
 ClusteringUtils.estimateDistanceCutoff(estimatePoints, 
 searcher.getDistanceMeasure());
 }
 StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, 
 estimateDistanceCutoff);
 while (datapointsIterator.hasNext()) {
   clusterer.cluster(datapointsIterator.next());
 }
 {Code}
 The code is using the same iterator twice, and it fails on the second use for 
 obvious reasons.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Jenkins build is back to normal : Mahout-Quality #2381

2013-12-23 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/2381/changes



Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Dmitriy Lyubimov
On Sat, Dec 21, 2013 at 6:28 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Hi All,

 Please see below the first draft of Release notes for Mahout 0.9. Please
 feel free to add/edit sections as u see fit.
 (This is a draft only).

 Regards,
 Suneel


 -


 The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
 Mahout's goal is to build scalable machine learning libraries focused
 primarily in the areas of collaborative filtering (recommenders),
 clustering and classification (known collectively as the 3Cs), as well
 as the
 necessary infrastructure to support those implementations including, but
 not limited to, math packages for statistics, linear algebra and others
 as well as Java primitive collections, local and distributed vector and
 matrix classes and a variety of integrative code to work with popular
 packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
 Cassandra and much more. The 0.9 release is mainly a clean up release in
 preparation for an upcoming 1.0 release targeted for first half of 2014,
 but there are a few
 significant new features, which are highlighted below.

 To get started with Apache Mahout 0.9,
  download the release artifacts and signatures at
 http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven
 repository.

 In
  addition to the release highlights and artifacts, please pay attention
 to the section labelled FUTURE PLANS below for more information about
 upcoming releases of Mahout.

 As with any release, we wish to thank all of the users and contributors
 to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
 individual credits, as there are too many to list here.

 GETTING STARTED

 In the release package, the examples directory contains several working
 examples of the core
 functionality available in Mahout. These can be run via scripts in the
 examples/bin
  directory and will prompt you for more information to help you try
 things out. Most examples do not need a Hadoop cluster in
 order to run.

 RELEASE HIGHLIGHTS

 The highlights of the Apache Mahout 0.9 release include, but are not
 limited to the list below. For further information, see the included
 CHANGELOG file.

 - Scala DSL Bindings for Mahout Math Linear Algebra (MAHOUT-1297).
See
 http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
 - New Multilayer Perceptron Classifier (MAHOUT-1265)
 - Recommenders as a Search (MAHOUT-1288).  See
 https://github.com/pferrel/solr-recommender
 - MAHOUT-1364: Upgrade Mahout to be Lucene 4.6.0 compliant
 - MAHOUT-1361: Online Algorithm for computing accurate Quantiles using
 1-dimensional Clustering
   See
 https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdffor
  the details.

 - Removed Deprecated algorithms.

 - the usual bug fixes. See JIRA [?} for more information on the 0.9
 release.


 A total 91 separate JIRA issues were addressed in this release.

 The following algorithms that were marked deprecated in 0.8 have been
 removed in 0.9:

 - From Clustering:
   Dirichlet - replaced by Collapsible Variational Bayes (CVB)


I think the name of the method i commonly hear is Collapsed Variational
Bayes


   Meanshift

   MinHash - removed due to poor performance and lack of usage

   EigenCuts -


 - From Classification (both are sequential implementations)

   Winnow - lack of actual usage

   Perceptron - lack of actual usage


 - Frequent Pattern Mining

 - Collaborative Filtering
 All recommenders in org.apache.mahout.cf.taste.impl.recommender.knn
 SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone
 and org.apache.mahout.cf.taste.impl.recommender.slopeone
 Distributed pseudo recommender in
 org.apache.mahout.cf.taste.hadoop.pseudo
 TreeClusteringRecommender in
 org.apache.mahout.cf.taste.impl.recommender

 - Mahout Math
 Lanczos in favour of SSVD
 Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

 If you are interested in supporting 1 or more of these algorithms, please
 make it known on dev@mahout.apache.org and via JIRA issues that fix
 and/or improve them. Please also provide
 supporting evidence as to their effectiveness for you in production.


 CONTRIBUTING

 Mahout
  is always looking for contributions focused on the 3Cs. If you are
 interested in contributing, please see our contribution page,
 https://cwiki.apache.org/MAHOUT/how-to-contribute.html, on the Mahout
 wiki or contact us via email at dev@mahout.apache.org.

 FUTURE PLANS

 1.0 Plans
 


 - New Downpour SGD classifier

 - Support for Finite State Transducers (FST) as a Dictionary Type.
 - Support for Hadoop 2.x
 - Port Mahout's recommenders to Spark (??)
 - Support for Java 7
 - Better API interfaces for Clustering
 - (what else???)


 As the project moves towards a 1.0 release, the community will be focused
 on
 key algorithms that are proven to scale in 

Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Dmitriy Lyubimov
On Sun, Dec 22, 2013 at 11:21 AM, Sebastian Schelter 
ssc.o...@googlemail.com wrote:


 
  - Mahout Math
  Lanczos in favour of SSVD

 IIRC, we agreed to not remove Lanczos, although it was initially
 deprecated. We should undeprecate it.


Some folks like Lanczos in Mahout (for reasons not really clear to me,
aside from accuracy when computing svd of a random noise, there are
actually 0 reasons to use Lanczos instead). I agree we don't  necessarily
want to cull it out -- but IMO there should be a clear steer posted in
favor of SSVD in the docs/javadocs.


Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Andrew Musselman
Suneel ran into some issues this weekend; I'm going to try it out and see if I 
can repro.

 On Dec 23, 2013, at 1:02 AM, Isabel Drost-Fromm isa...@apache.org wrote:
 
 Hi,
 
 one thing I forgot: you once mentioned running into issues with the new 
 kmeans - are those fixed or tracked in jira? In case of the latter we should 
 include a known issues/ call for helping hands section.
 
 Isabel


[jira] [Created] (MAHOUT-1388) Add command line support and logging for MLP

2013-12-23 Thread Yexi Jiang (JIRA)
Yexi Jiang created MAHOUT-1388:
--

 Summary: Add command line support and logging for MLP
 Key: MAHOUT-1388
 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 1.0
Reporter: Yexi Jiang
 Fix For: 1.0


The user should have the ability to run the Perceptron from the command line.

There are two modes for MLP, the training and labeling, the first one takes the 
data as input and outputs the model, the second one takes the model and 
unlabeled data as input and outputs the results.

The parameters are as follows:

--mode -mo // train or label
--input -i (input data)
--model -mo  // in training mode, this is the location to store the model (if 
the specified location has an existing model, it will update the model through 
incremental learning), in labeling mode, this is the location to store the 
result
--output -o   // this is only useful in labeling mode
--layersize -ls (no. of units per hidden layer) // use comma separated number 
to indicate the number of neurons in each layer (including input layer and 
output layer)
--momentum -m 
--learningrate -l
--regularizationweight -r
--costfunction -cf   // the type of cost function,

For example, train a 3-layer (including input, hidden, and output) MLP with 
Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 
regularization weight, the parameter would be:

mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 
0.1 -r 0.01 -cf minus_squared

This command would read the training data from /tmp/training-data.csv and write 
the trained model to /tmp/model.model.

If a user need to use an existing model, it will use the following command:
mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result

Moreover, we should be providing default values if the user does not specify 
any. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)