date:20131223

Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Isabel Drost-Fromm

Hi,

one thing I forgot: you once mentioned running into issues with the new kmeans 
- are those fixed or tracked in jira? In case of the latter we should include a 
known issues/ call for helping hands section.

Isabel

[jira] [Created] (MAHOUT-1387) Create page for release notes

2013-12-23 Thread Isabel Drost-Fromm (JIRA)

Isabel Drost-Fromm created MAHOUT-1387:
--

 Summary: Create page for release notes
 Key: MAHOUT-1387
 URL: https://issues.apache.org/jira/browse/MAHOUT-1387
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.8
Reporter: Isabel Drost-Fromm
Priority: Minor


Starting 0.6 our release notes are published on our main web page - interleaved 
with other news items.

For reference it would be good to have one canonical go-to page for past 
release notes on our main Apache CMS powered web page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1305) Rework the wiki

2013-12-23 Thread Isabel Drost-Fromm (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855804#comment-13855804
]

Isabel Drost-Fromm commented on MAHOUT-1305:

* Pages now available on the CMS moved under DeletionCandidates parent:

Please double check - if I hear nothing until Dec 28th I'll delete them.

* Pages with bogus content or nearly no content deleted.
* Moved all pages that I remembered being referred to under RedirectPages,
editing each to contain a link that points to the current CMS based version.

There wasn't a whole lot left: Three pages looked like they could be valuable
moved to CMS Migration Candidates. I'd keep the remaining for simply in the
wiki.

Concerning the CSS: I couldn't change the CSS (not even with my recovered
isabel account.) However for the few pages I thought could be valuable to
keep it helped to explicitly set them to be left aligned in the edit box.

Concerning the link to the wiki - probably as someone asked me for a link after
publishing the new main web page I had forgotten that this is already available
on our main web site (check the entry in the General tab). Let me know
whether this link should be someplace more prominent (keep in mind though that
given that we have Apache CMS now there probably won't be a whole lot of
content left to put into the wiki anyway.)

Rework the wiki
---

Key: MAHOUT-1305
URL: https://issues.apache.org/jira/browse/MAHOUT-1305
Project: Mahout
Issue Type: Bug
Components: Documentation
Reporter: Sebastian Schelter
Priority: Blocker
Fix For: 0.9

Attachments: MAHOUT-221213-1315-15716.pdf

We should think about completely redoing our wiki. At the moment, we're
listing lots of algorithms that we either never implemented or already
removed. I also have the impression that a lot of stuff is outdated.
It would be awesome if we had an up-to-date documentation of the code with
instructions on how to get into using mahout quickly.
We should also have examples for all our 3 C's.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Streaming KMeans clustering

2013-12-23 Thread Suneel Marthi

Has anyone be successful running Streaming KMeans clustering on a large dataset 
( 100,000 points)?


It just seems to take a very long time ( 4hrs) for the mappers to finish on 
about 300K data points and the reduce phase has only a single reducer running 
and throws an OOM failing the job several hours after the job has been kicked 
off.

Its the same story when trying to run in sequential mode.

Looking at the code the bottleneck seems to be in 
StreamingKMeans.clusterInternal(), without understanding the behaviour of the 
algorithm I am not sure if the sequence of steps in there is correct. 


There are few calls that call themselves repeatedly over and over again like 
SteamingKMeans.clusterInternal() and Searcher.searchFirst().

We really need to have this working on datasets that are larger than 20K 
reuters datasets.

I am trying to run this on 300K vectors with k= 100, km = 1261 and 
FastProjectSearch.

Re: Streaming KMeans clustering

2013-12-23 Thread Sebastian Schelter

That the algorithm runs a single reducer is expected. The algorithm
creates a sketch of the data in parallel in the map-phase, which is
collected by the reducer afterwards. The reducer then applies an
expensive in-memory clustering algorithm to the sketch.

Which dataset are you using for testing? I can also do some tests on a
cluster here.

I can imagine two possible causes for the problems: Maybe there's a
problem with the vectors and some calculations take very long because
the wrong access pattern or implementation is chosen.

Another problem could be that the mappers and reducers have too few
memory and spend a lot of time running garbage collections.

--sebastian


On 23.12.2013 22:14, Suneel Marthi wrote:
 Has anyone be successful running Streaming KMeans clustering on a large 
 dataset ( 100,000 points)?
 
 
 It just seems to take a very long time ( 4hrs) for the mappers to finish on 
 about 300K data points and the reduce phase has only a single reducer running 
 and throws an OOM failing the job several hours after the job has been kicked 
 off.
 
 Its the same story when trying to run in sequential mode.
 
 Looking at the code the bottleneck seems to be in 
 StreamingKMeans.clusterInternal(), without understanding the behaviour of the 
 algorithm I am not sure if the sequence of steps in there is correct. 
 
 
 There are few calls that call themselves repeatedly over and over again like 
 SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
 We really need to have this working on datasets that are larger than 20K 
 reuters datasets.
 
 I am trying to run this on 300K vectors with k= 100, km = 1261 and 
 FastProjectSearch.

[jira] [Commented] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true

2013-12-23 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855974#comment-13855974
 ] 

Hudson commented on MAHOUT-1358:


SUCCESS: Integrated in Mahout-Quality #2381 (See 
[https://builds.apache.org/job/Mahout-Quality/2381/])
MAHOUT-1358 - earlier fix for this issue throws a heap space exception for 
large datasets during the Mapper phase, new fix in place now and code cleanup. 
(smarthi: rev 1553189)
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java


 StreamingKMeansThread throws IllegalArgumentException when 
 REDUCE_STREAMING_KMEANS is set to true
 -

 Key: MAHOUT-1358
 URL: https://issues.apache.org/jira/browse/MAHOUT-1358
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.9

 Attachments: MAHOUT-1358.patch


 Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true, 
 throws the following error
 {Code}
 java.lang.IllegalArgumentException: Must have nonzero number of training and 
 test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0]
   at 
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:120)
   at 
 org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
   at 
 org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
   at 
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
   at 
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
   at 
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
   at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
 {Code}
 The issue is caused by the following code in StreamingKMeansThread.call()
 {Code}
 IteratorCentroid datapointsIterator = datapoints.iterator();
 if (estimateDistanceCutoff == 
 StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) {
   ListCentroid estimatePoints = 
 Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS);
   while (datapointsIterator.hasNext()  estimatePoints.size()  
 NUM_ESTIMATE_POINTS) {
 estimatePoints.add(datapointsIterator.next());
   }
   estimateDistanceCutoff = 
 ClusteringUtils.estimateDistanceCutoff(estimatePoints, 
 searcher.getDistanceMeasure());
 }
 StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, 
 estimateDistanceCutoff);
 while (datapointsIterator.hasNext()) {
   clusterer.cluster(datapointsIterator.next());
 }
 {Code}
 The code is using the same iterator twice, and it fails on the second use for 
 obvious reasons.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Jenkins build is back to normal : Mahout-Quality #2381

2013-12-23 Thread Apache Jenkins Server

See https://builds.apache.org/job/Mahout-Quality/2381/changes

Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Dmitriy Lyubimov

On Sat, Dec 21, 2013 at 6:28 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

Hi All,

Please see below the first draft of Release notes for Mahout 0.9. Please
feel free to add/edit sections as u see fit.
(This is a draft only).

Regards,
Suneel

The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
Mahout's goal is to build scalable machine learning libraries focused
primarily in the areas of collaborative filtering (recommenders),
clustering and classification (known collectively as the 3Cs), as well
as the
necessary infrastructure to support those implementations including, but
not limited to, math packages for statistics, linear algebra and others
as well as Java primitive collections, local and distributed vector and
matrix classes and a variety of integrative code to work with popular
packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
Cassandra and much more. The 0.9 release is mainly a clean up release in
preparation for an upcoming 1.0 release targeted for first half of 2014,
but there are a few
significant new features, which are highlighted below.

To get started with Apache Mahout 0.9,
download the release artifacts and signatures at
http://www.apache.org/dyn/closer.cgi/mahout or visit the central Maven
repository.

In
addition to the release highlights and artifacts, please pay attention
to the section labelled FUTURE PLANS below for more information about
upcoming releases of Mahout.

As with any release, we wish to thank all of the users and contributors
to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
individual credits, as there are too many to list here.

GETTING STARTED

In the release package, the examples directory contains several working
examples of the core
functionality available in Mahout. These can be run via scripts in the
examples/bin
directory and will prompt you for more information to help you try
things out. Most examples do not need a Hadoop cluster in
order to run.

RELEASE HIGHLIGHTS

The highlights of the Apache Mahout 0.9 release include, but are not
limited to the list below. For further information, see the included
CHANGELOG file.

- Scala DSL Bindings for Mahout Math Linear Algebra (MAHOUT-1297).
See
http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
- New Multilayer Perceptron Classifier (MAHOUT-1265)
- Recommenders as a Search (MAHOUT-1288). See
https://github.com/pferrel/solr-recommender
- MAHOUT-1364: Upgrade Mahout to be Lucene 4.6.0 compliant
- MAHOUT-1361: Online Algorithm for computing accurate Quantiles using
1-dimensional Clustering
See
https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdffor
the details.

- Removed Deprecated algorithms.

- the usual bug fixes. See JIRA [?} for more information on the 0.9
release.

A total 91 separate JIRA issues were addressed in this release.

The following algorithms that were marked deprecated in 0.8 have been
removed in 0.9:

- From Clustering:
Dirichlet - replaced by Collapsible Variational Bayes (CVB)

I think the name of the method i commonly hear is Collapsed Variational
Bayes

Meanshift

MinHash - removed due to poor performance and lack of usage

EigenCuts -

- From Classification (both are sequential implementations)

Winnow - lack of actual usage

Perceptron - lack of actual usage

- Frequent Pattern Mining

- Collaborative Filtering
All recommenders in org.apache.mahout.cf.taste.impl.recommender.knn
SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone
and org.apache.mahout.cf.taste.impl.recommender.slopeone
Distributed pseudo recommender in
org.apache.mahout.cf.taste.hadoop.pseudo
TreeClusteringRecommender in
org.apache.mahout.cf.taste.impl.recommender

- Mahout Math
Lanczos in favour of SSVD
Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

If you are interested in supporting 1 or more of these algorithms, please
make it known on dev@mahout.apache.org and via JIRA issues that fix
and/or improve them. Please also provide
supporting evidence as to their effectiveness for you in production.

CONTRIBUTING

Mahout
is always looking for contributions focused on the 3Cs. If you are
interested in contributing, please see our contribution page,
https://cwiki.apache.org/MAHOUT/how-to-contribute.html, on the Mahout
wiki or contact us via email at dev@mahout.apache.org.

FUTURE PLANS

1.0 Plans

- New Downpour SGD classifier

- Support for Finite State Transducers (FST) as a Dictionary Type.
- Support for Hadoop 2.x
- Port Mahout's recommenders to Spark (??)
- Support for Java 7
- Better API interfaces for Clustering
- (what else???)

As the project moves towards a 1.0 release, the community will be focused
on
key algorithms that are proven to scale in

Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Dmitriy Lyubimov

On Sun, Dec 22, 2013 at 11:21 AM, Sebastian Schelter 
ssc.o...@googlemail.com wrote:


 
  - Mahout Math
  Lanczos in favour of SSVD

 IIRC, we agreed to not remove Lanczos, although it was initially
 deprecated. We should undeprecate it.


Some folks like Lanczos in Mahout (for reasons not really clear to me,
aside from accuracy when computing svd of a random noise, there are
actually 0 reasons to use Lanczos instead). I agree we don't  necessarily
want to cull it out -- but IMO there should be a clear steer posted in
favor of SSVD in the docs/javadocs.

Re: Mahout 0.9 Release Notes - First Draft

2013-12-23 Thread Andrew Musselman

Suneel ran into some issues this weekend; I'm going to try it out and see if I 
can repro.

 On Dec 23, 2013, at 1:02 AM, Isabel Drost-Fromm isa...@apache.org wrote:
 
 Hi,
 
 one thing I forgot: you once mentioned running into issues with the new 
 kmeans - are those fixed or tracked in jira? In case of the latter we should 
 include a known issues/ call for helping hands section.
 
 Isabel

[jira] [Created] (MAHOUT-1388) Add command line support and logging for MLP

2013-12-23 Thread Yexi Jiang (JIRA)

Yexi Jiang created MAHOUT-1388:
--

 Summary: Add command line support and logging for MLP
 Key: MAHOUT-1388
 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 1.0
Reporter: Yexi Jiang
 Fix For: 1.0


The user should have the ability to run the Perceptron from the command line.

There are two modes for MLP, the training and labeling, the first one takes the 
data as input and outputs the model, the second one takes the model and 
unlabeled data as input and outputs the results.

The parameters are as follows:

--mode -mo // train or label
--input -i (input data)
--model -mo  // in training mode, this is the location to store the model (if 
the specified location has an existing model, it will update the model through 
incremental learning), in labeling mode, this is the location to store the 
result
--output -o   // this is only useful in labeling mode
--layersize -ls (no. of units per hidden layer) // use comma separated number 
to indicate the number of neurons in each layer (including input layer and 
output layer)
--momentum -m 
--learningrate -l
--regularizationweight -r
--costfunction -cf   // the type of cost function,

For example, train a 3-layer (including input, hidden, and output) MLP with 
Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 
regularization weight, the parameter would be:

mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 -m 
0.1 -r 0.01 -cf minus_squared

This command would read the training data from /tmp/training-data.csv and write 
the trained model to /tmp/model.model.

If a user need to use an existing model, it will use the following command:
mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result

Moreover, we should be providing default values if the user does not specify 
any. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: Mahout 0.9 Release Notes - First Draft

[jira] [Created] (MAHOUT-1387) Create page for release notes

[jira] [Commented] (MAHOUT-1305) Rework the wiki

Streaming KMeans clustering

Re: Streaming KMeans clustering

[jira] [Commented] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true

Jenkins build is back to normal : Mahout-Quality #2381

Re: Mahout 0.9 Release Notes - First Draft

Re: Mahout 0.9 Release Notes - First Draft

Re: Mahout 0.9 Release Notes - First Draft

[jira] [Created] (MAHOUT-1388) Add command line support and logging for MLP

11 matches

Site Navigation

Mail list logo

Footer information