http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/docs/0.13.0/tutorials/play-with-shell.md ---------------------------------------------------------------------- diff --git a/website/_pages/docs/0.13.0/tutorials/play-with-shell.md b/website/_pages/docs/0.13.0/tutorials/play-with-shell.md new file mode 100644 index 0000000..0c88839 --- /dev/null +++ b/website/_pages/docs/0.13.0/tutorials/play-with-shell.md @@ -0,0 +1,198 @@ +--- +layout: mahoutdoc +title: Mahout Samsara In Core +permalink: /docs/0.13.0/tutorials/samsara-spark-shell +--- +# Playing with Mahout's Spark Shell + +This tutorial will show you how to play with Mahout's scala DSL for linear algebra and its Spark shell. **Please keep in mind that this code is still in a very early experimental stage**. + +_(Edited for 0.10.2)_ + +## Intro + +We'll use an excerpt of a publicly available [dataset about cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a set of cereals, as well as a customer rating for the cereals. Our aim for this example is to fit a linear model which infers the customer rating from the ingredients. + + +Name | protein | fat | carbo | sugars | rating +:-----------------------|:--------|:----|:------|:-------|:--------- +Apple Cinnamon Cheerios | 2 | 2 | 10.5 | 10 | 29.509541 +Cap'n'Crunch | 1 | 2 | 12 | 12 | 18.042851 +Cocoa Puffs | 1 | 1 | 12 | 13 | 22.736446 +Froot Loops | 2 | 1 | 11 | 13 | 32.207582 +Honey Graham Ohs | 1 | 2 | 12 | 11 | 21.871292 +Wheaties Honey Gold | 2 | 1 | 16 | 8 | 36.187559 +Cheerios | 6 | 2 | 17 | 1 | 50.764999 +Clusters | 3 | 2 | 13 | 7 | 40.400208 +Great Grains Pecan | 3 | 3 | 13 | 4 | 45.811716 + + +## Installing Mahout & Spark on your local machine + +We describe how to do a quick toy setup of Spark & Mahout on your local machine, so that you can run this example and play with the shell. + + 1. Download [Apache Spark 1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and unpack the archive file + 1. Change to the directory where you unpacked Spark and type ```sbt/sbt assembly``` to build it + 1. Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub ```git clone https://github.com/apache/mahout mahout``` + 1. Change to the ```mahout``` directory and build mahout using ```mvn -DskipTests clean install``` + +## Starting Mahout's Spark shell + + 1. Goto the directory where you unpacked Spark and type ```sbin/start-all.sh``` to locally start Spark + 1. Open a browser, point it to [http://localhost:8080/](http://localhost:8080/) to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with **spark://**) + 1. Define the following environment variables: <pre class="codehilite">export MAHOUT_HOME=[directory into which you checked out Mahout] +export SPARK_HOME=[directory where you unpacked Spark] +export MASTER=[url of the Spark master] +</pre> + 1. Finally, change to the directory where you unpacked Mahout and type ```bin/mahout spark-shell```, +you should see the shell starting and get the prompt ```mahout> ```. Check +[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further troubleshooting. + +## Implementation + +We'll use the shell to interactively play with the data and incrementally implement a simple [linear regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout unless we processed a large dataset stored in a distributed filesystem. But for the sake of this example, we'll use our tiny toy dataset and "pretend" it was too big to fit onto a single machine. + +*Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.* + +Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix (DRM)* which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use ```dense()``` to create a dense in-memory matrix from our toy dataset and use ```drmParallelize``` to load it into the cluster, "mimicking" a large, partitioned dataset. + +<div class="codehilite"><pre> +val drmData = drmParallelize(dense( + (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios + (1, 2, 12, 12, 18.042851), // Cap'n'Crunch + (1, 1, 12, 13, 22.736446), // Cocoa Puffs + (2, 1, 11, 13, 32.207582), // Froot Loops + (1, 2, 12, 11, 21.871292), // Honey Graham Ohs + (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold + (6, 2, 17, 1, 50.764999), // Cheerios + (3, 2, 13, 7, 40.400208), // Clusters + (3, 3, 13, 4, 45.811716)), // Great Grains Pecan + numPartitions = 2); +</pre></div> + +Have a look at this matrix. The first four columns represent the ingredients +(our features) and the last column (the rating) is the target variable for +our regression. [Linear regression](https://en.wikipedia.org/wiki/Linear_regression) +assumes that the **target variable** `\(\mathbf{y}\)` is generated by the +linear combination of **the feature matrix** `\(\mathbf{X}\)` with the +**parameter vector** `\(\boldsymbol{\beta}\)` plus the + **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula +`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. +Our goal is to find an estimate of the parameter vector +`\(\boldsymbol{\beta}\)` that explains the data very well. + +As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.** + +<div class="codehilite"><pre> +val drmX = drmData(::, 0 until 4) +</pre></div> + +Next, we extract the target variable vector *y*, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using ```collect```: + +<div class="codehilite"><pre> +val y = drmData.collect(::, 4) +</pre></div> + +Now we are ready to think about a mathematical way to estimate the parameter vector *β*. A simple textbook approach is [ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating `\(\boldsymbol{\beta}\)` as +`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`. + +The first thing which we compute for this is `\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala DSL maps directly to the mathematical formula. The operation ```.t()``` transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication. + +<div class="codehilite"><pre> +val drmXtX = drmX.t %*% drmX +</pre></div> + +The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can simply type the math in scala expressions into the shell. Here, *X* lives in the cluster, while is *y* in the memory of the driver, and the result is a DRM again. +<div class="codehilite"><pre> +val drmXty = drmX.t %*% y +</pre></div> + +We're nearly done. The next step we take is to fetch `\(\mathbf{X}^{\top}\mathbf{X}\)` and +`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we are targeting +features matrices that are tall and skinny , +so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough +to fit in). Then, we provide them to an in-memory solver (Mahout provides +the an analog to R's ```solve()``` for that) which computes ```beta```, our +OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`. + +<div class="codehilite"><pre> +val XtX = drmXtX.collect +val Xty = drmXty.collect(::, 0) + +val beta = solve(XtX, Xty) +</pre></div> + +That's it! We have a implemented a distributed linear regression algorithm +on Apache Spark. I hope you agree that we didn't have to worry a lot about +parallelization and distributed systems. The goal of Mahout's linear algebra +DSL is to abstract away the ugliness of programming a distributed system +as much as possible, while still retaining decent performance and +scalability. + +We can now check how well our model fits its training data. +First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of +`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of +the target variable `\(\mathbf{y}\)` to the fitted target variable: + +<div class="codehilite"><pre> +val yFitted = (drmX %*% beta).collect(::, 0) +(y - yFitted).norm(2) +</pre></div> + +We hope that we could show that Mahout's shell allows people to interactively and incrementally write algorithms. We have entered a lot of individual commands, one-by-one, until we got the desired results. We can now refactor a little by wrapping our statements into easy-to-use functions. The definition of functions follows standard scala syntax. + +We put all the commands for ordinary least squares into a function ```ols```. + +<div class="codehilite"><pre> +def ols(drmX: DrmLike[Int], y: Vector) = + solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0) + +</pre></div> + +Note that DSL declares implicit `collect` if coersion rules require an in-core argument. Hence, we can simply +skip explicit `collect`s. + +Next, we define a function ```goodnessOfFit``` that tells how well a model fits the target variable: + +<div class="codehilite"><pre> +def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = { + val fittedY = (drmX %*% beta).collect(::, 0) + (y - fittedY).norm(2) +} +</pre></div> + +So far we have left out an important aspect of a standard linear regression +model. Usually there is a constant bias term added to the model. Without +that, our model always crosses through the origin and we only learn the +right angle. An easy way to add such a bias term to our model is to add a +column of ones to the feature matrix `\(\mathbf{X}\)`. +The corresponding weight in the parameter vector will then be the bias term. + +Here is how we add a bias column: + +<div class="codehilite"><pre> +val drmXwithBiasColumn = drmX cbind 1 +</pre></div> + +Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model fitting method ```ols``` and see how well the resulting model fits the training data with ```goodnessOfFit```. You should see a large improvement in the result. + +<div class="codehilite"><pre> +val betaWithBiasTerm = ols(drmXwithBiasColumn, y) +goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y) +</pre></div> + +As a further optimization, we can make use of the DSL's caching functionality. We use ```drmXwithBiasColumn``` repeatedly as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling ```checkpoint()```. In the end, we remove it from the cache with uncache: + +<div class="codehilite"><pre> +val cachedDrmX = drmXwithBiasColumn.checkpoint() + +val betaWithBiasTerm = ols(cachedDrmX, y) +val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y) + +cachedDrmX.uncache() + +goodness +</pre></div> + + +Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html). \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/downloads.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/downloads.mdtext b/website/_pages/downloads.mdtext new file mode 100644 index 0000000..047d658 --- /dev/null +++ b/website/_pages/downloads.mdtext @@ -0,0 +1,63 @@ +Title: Downloads + +<a name="Downloads-OfficialRelease"></a> +# Official Release +Apache Mahout is an official Apache project and thus available from any of +the Apache mirrors. The latest Mahout release is available for download at: + +* [Download Latest](http://www.apache.org/dyn/closer.cgi/mahout/) +* [Release Archive](http://archive.apache.org/dist/mahout/) + + +# Source code for the current snapshot + +Apache Mahout is mirrored to [Github](https://github.com/apache/mahout). To get all source: + + git clone https://github.com/apache/mahout.git mahout + +# Environment + +Whether you are using Mahout's Shell, running command line jobs or using it as a library to build your own apps +you'll need to setup several environment variables. +Edit your environment in ```~/.bash_profile``` for Mac or ```~/.bashrc``` for many linux distributions. Add the following + + export MAHOUT_HOME=/path/to/mahout + export MAHOUT_LOCAL=true # for running standalone on your dev machine, + # unset MAHOUT_LOCAL for running on a cluster + +If you are running on Spark you will also need $SPARK_HOME + +Make sure to have $JAVA_HOME set also + +# Using Mahout as a Library + +Running any application that uses Mahout will require installing a binary or source version and setting the environment. +Then add the appropriate setting to your pom.xml or build.sbt following the template below. + +If you only need the math part of Mahout: + + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-math</artifactId> + <version>${mahout.version}</version> + </dependency> + +In case you would like to use some of our integration tooling (e.g. for generating vectors from Lucene): + + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-hdfs</artifactId> + <version>${mahout.version}</version> + </dependency> + +In case you are using Ivy, Gradle, Buildr, Grape or SBT you might want to directly head over to the official [Maven Repository search](http://mvnrepository.com/artifact/org.apache.mahout/mahout-core). + + +<a name="Downloads-FutureReleases"></a> +# Future Releases + +Official releases are usually created when the developers feel there are +sufficient changes, improvements and bug fixes to warrant a release. Watch +the <a href="https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html">Mailing lists</a> + for latest release discussions and check the Github repo. + http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/dustin.html ---------------------------------------------------------------------- diff --git a/website/_pages/dustin.html b/website/_pages/dustin.html new file mode 100644 index 0000000..de57b01 --- /dev/null +++ b/website/_pages/dustin.html @@ -0,0 +1,20 @@ +--- +layout: mahout +title: Dustins Test +permalink: /dustin/ +--- + +It doesn't matter what comes, fresh goes better in life, with Mentos fresh and full of Life! Nothing gets to you, stayin' fresh, stayin' cool, with Mentos fresh and full of life! Fresh goes better! Mentos freshness! Fresh goes better with Mentos, fresh and full of life! Mentos! The Freshmaker! + +We got a right to pick a little fight, Bonanza! If anyone fights anyone of us, he's gotta fight with me! We're not a one to saddle up and run, Bonanza! Anyone of us who starts a little fuss knows he can count on me! One for four, four for one, this we guarantee. We got a right to pick a little fight, Bonanza! If anyone fights anyone of us he's gotta fight with me! + + <div class="col-md-12"> + {% for post in paginator.posts %} + {% include tile.html %} + {% endfor %} + + + {% include pagination.html %} + </div> + + http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/faq.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/faq.mdtext b/website/_pages/faq.mdtext new file mode 100644 index 0000000..15a7e23 --- /dev/null +++ b/website/_pages/faq.mdtext @@ -0,0 +1,100 @@ +Title: FAQ + +# The Official Mahout FAQ + +*General* + +1. [What is Apache Mahout?](#whatis) +1. [What does the name mean?](#mean) +1. [How is the name pronounced?](#pronounce) +1. [Where can I find the origins of the Mahout project?](#historical) +1. [Where can I download the Mahout logo?](#downloadlogo) +1. [Where can I download Mahout slide presentations?](#presentations) + +*Algorithms* + +1. [What algorithms are implemented in Mahout?](#algos) +1. [What algorithms are missing from Mahout?](#todo) +1. [Do I need Hadoop to run Mahout?](#hadoop) + +*Hadoop specific questions* + +1. [Mahout just won't run in parallel on my dataset. Why?](#split) + + +# *Answers* + + +## General + + +<a name="whatis"></a> +#### What is Apache Mahout? + +Apache Mahout is a suite of machine learning libraries designed to be +scalable and robust + +<a name="mean"></a> +#### What does the name mean? + +The name [Mahout](http://en.wikipedia.org/wiki/Mahout) + was original chosen for it's association with the [Apache Hadoop](http://hadoop.apache.org) + project. A Mahout is a person who drives an elephant (hint: Hadoop's logo +is an elephant). We just wanted a name that complemented Hadoop but we see +our project as a good driver of Hadoop in the sense that we will be using +and testing it. We are not, however, implying that we are controlling +Hadoop's development. + +Prior to coming to the ASF, those of us working on the project plan voted between [Howdah](http://en.wikipedia.org/wiki/Howdah) â the carriage on top of an elephant and Mahout. + +<a name="historical"></a> +#### Where can I find the origins of the Mahout project? + +See [http://ml-site.grantingersoll.com](http://web.archive.org/web/20080101233917/http://ml-site.grantingersoll.com/index.php?title=Main_Page) + for old wiki and mailing list archives (all read-only) + +Mahout was started by <a href="http://web.archive.org/web/20071228055210/http://ml-site.grantingersoll.com/index.php?title=Main_Page" class="external-link" rel="nofollow">Isabel Drost, Grant Ingersoll and Karl Wettin</a>. It <a href="http://web.archive.org/web/20080201093120/http://lucene.apache.org/#22+January+2008+-+Lucene+PMC+Approves+Mahout+Machine+Learning+Project" class="external-link" rel="nofollow">started</a> as part of the <a href="http://lucene.apache.org" class="external-link" rel="nofollow">Lucene</a> project (see the <a href="http://web.archive.org/web/20080102151102/http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal" class="external-link" rel="nofollow">original proposal</a>) and went on to become a top level project in April of 2010.</p><p style="text-align: left;">The original goal was to implement all 10 algorithms from Andrew Ng's paper "<a href="http://ai.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf" class="external-link" rel="nof ollow">Map-Reduce for Machine Learning on Multicore</a>"</p> + +<a name="pronounce"></a> +#### How is the name pronounced? + +There are some disagreements about how to pronounce the name. Webster's has it as muh-hout (as in ["out"](http://dictionary.reference.com/browse/mahout)), but the Sanskrit/Hindi origins pronounce it as "muh-hoot". The second pronunciation suggests a nice pun on the Hebrew word ×××ת meaning "essence or truth". + +<a name="downloadlogo"></a> +#### Where can I download the Mahout logo? + +See [MAHOUT-335](https://issues.apache.org/jira/browse/MAHOUT-335) + + +<a name="presentations"></a> +#### Where can I download Mahout slide presentations? + +The [Books, Tutorials and Talks](https://mahout.apache.org/general/books-tutorials-and-talks.html) + page contains an overview of a wide variety of presentations with links to slides where available. + +## Algorithms + +<a name="algos"></a> +#### What algorithms are implemented in Mahout? + +We are interested in a wide variety of machine learning algorithms. Many of +which are already implemented in Mahout. You can find a list [here](https://mahout.apache.org/users/basics/algorithms.html). + +<a name="todo"></a> +#### What algorithms are missing from Mahout? + +There are many machine learning algorithms that we would like to have in +Mahout. If you have an algorithm or an improvement to an algorithm that you would +like to implement, start a discussion on our [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html). + +<a name="hadoop"></a> +#### Do I need Hadoop to use Mahout? + +There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the [algorithms list](https://mahout.apache.org/users/basics/algorithms.html). In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as [Apache Spark](http://spark.apache.org) + +## Hadoop specific questions +<a name="split"></a> +#### Mahout just won't run in parallel on my dataset. Why? + +If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb, +anything below 100MB in size won't be split by default. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/github.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/github.mdtext b/website/_pages/github.mdtext new file mode 100644 index 0000000..f28f01d --- /dev/null +++ b/website/_pages/github.mdtext @@ -0,0 +1,168 @@ +Title: +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +# Github Setup and Pull Requests (PRs) # + +There are several ways to setup Git for committers and contributors. Contributors can safely setup +Git any way they choose but committers should take extra care since they can push new commits to the master at +Apache and various policies there make backing out mistakes problematic. Therefore all but very small changes should +go through a PR, even for committers. To keep the commit history clean take note of the use of --squash below +when merging into apache/master. + +##Git setup for Committers + +This describes setup for one local repo and two remotes. It allows you to push the code on your machine to either your Github repo or to git-wip-us.apache.org. +You will want to fork github's apache/mahout to your own account on github, this will enable Pull Requests of your own. +Cloning this fork locally will set up "origin" to point to your remote fork on github as the default remote. +So if you perform "git push origin master" it will go to github. + +To attach to the apache git repo do the following: + + git remote add apache https://git-wip-us.apache.org/repos/asf/mahout.git + +To check your remote setup + + git remote -v + +you should see something like this: + + origin https://github.com/your-github-id/mahout.git (fetch) + origin https://github.com/your-github-id/mahout.git (push) + apache https://git-wip-us.apache.org/repos/asf/mahout.git (fetch) + apache https://git-wip-us.apache.org/repos/asf/mahout.git (push) + +Now if you want to experiment with a branch everything, by default, points to your github account because 'origin' is default. You can work as normal using only github until you are ready to merge with the apache remote. Some conventions will integrate with Apache Jira ticket numbers. + + git checkout -b mahout-xxxx #xxxx typically is a Jira ticket number + #do some work on the branch + git commit -a -m "doing some work" + git push origin mahout-xxxx # notice pushing to **origin** not **apache** + +Once you are ready to commit to the apache remote you can merge and push them directly or better yet create a PR. + +##How to create a PR (committers) + +Push your branch to Github: + + git checkout mahout-xxxx + git push origin mahout-xxxx + +Go to your mahout-xxxx branch on Github. Since you forked it from Github's apache/mahout it will default +any PR to go to apache/master. + +* Click the green "Compare, review, and create pull request" button. +* You can edit the to and from for the PR if it isn't correct. The "base fork" should be apache/mahout unless you are collaborating +separately with one of the committers on the list. The "base" will be master. Don't submit a PR to one of the other +branches unless you know what you are doing. The "head fork" will be your forked repo and the "compare" will be +your mahout-xxxx branch. +* Click the "Create pull request" button and name the request "MAHOUT-XXXX" all caps. +This will connect the comments of the PR to the mailing list and Jira comments. +* From now on the PR lives on github's apache/mahout. You use the commenting UI there. +* If you are looking for a review or sharing with someone else say so in the comments but don't worry about +automated merging of your PR--you will have to do that later. The PR is tied to your branch so you can respond to +comments, make fixes, and commit them from your local repo. They will appear on the PR page and be mirrored to Jira +and the mailing list. + +When you are satisfied and want to push it to Apache's remote repo proceed with **Merging a PR** + +## How to create a PR (contributors) + +Create pull requests: \[[1]\]. + +Pull requests are made to apache/mahout repository on Github. In the Github UI you should pick the master +branch to target the PR as described for committers. This will be reviewed and commented on so the merge is +not automatic. This can be used for discussing a contributions in progress. + +## Merging a PR (yours or contributors) + +Start with reading \[[2]\] (merging locally). + +Remember that pull requests are equivalent to a remote github branch with potentially a multitude of commits. +In this case it is recommended to squash remote commit history to have one commit per issue, rather +than merging in a multitude of contributor's commits. In order to do that, as well as close the PR at the +same time, it is recommended to use **squash commits**. + +Merging pull requests are equivalent to a "pull" of a contributor's branch: + + git checkout master # switch to local master branch + git pull apache master # fast-forward to current remote HEAD + git pull --squash https://github.com/cuser/mahout cbranch # merge to master + +--squash ensures all PR history is squashed into single commit, and allows committer to use his/her own +message. Read git help for merge or pull for more information about `--squash` option. In this example we +assume that the contributor's Github handle is "cuser" and the PR branch name is "cbranch". +Next, resolve conflicts, if any, or ask a contributor to rebase on top of master, if PR went out of sync. + +If you are ready to merge your own (committer's) PR you probably only need to merge (not pull), since you have a local copy +that you've been working on. This is the branch that you used to create the PR. + + git checkout master # switch to local master branch + git pull apache master # fast-forward to current remote HEAD + git merge --squash mahout-xxxx + +Remember to run regular patch checks, build with tests enabled, and change CHANGELOG. + +If everything is fine, you now can commit the squashed request along the lines + + git commit --author <contributor_email> -a -m "MAHOUT-XXXX description closes apache/mahout#ZZ" + +MAHOUT-XXXX is all caps and where `ZZ` is the pull request number on apache/mahout repository. Including +"closes apache/mahout#ZZ" will close the PR automatically. More information is found here \[[3]\]. + +Next, push to git-wip-us.a.o: + + push apache master + +(this will require Apache handle credentials). + +The PR, once pushed, will get mirrored to github. To update your github version push there too: + + push origin master + +*Note on squashing: Since squash discards remote branch history, repeated PRs from the same remote branch are +difficult for merging. The workflow implies that every new PR starts with a new rebased branch. This is more +important for contributors to know, rather than for committers, because if new PR is not mergeable, github +would warn to begin with. Anyway, watch for dupe PRs (based on same source branches). This is a bad practice.* + +## Closing a PR without committing (for committers) + +When we want to reject a PR (close without committing), we can just issue an empty commit on master's HEAD +*without merging the PR*: + + git commit --allow-empty -m "closes apache/mahout#ZZ *Won't fix*" + git push apache master + +that should close PR `ZZ` on github mirror without merging and any code modifications in the master repository. + +## Apache/github integration features + +Read \[[4]\]. Comments and PRs with Mahout issue handles should post to mailing lists and Jira. +Mahout issue handles must in the form MAHOUT-YYYYY (all capitals). Usually it makes sense to +file a jira issue first, and then create a PR with description + + MAHOUT-YYYY: <jira-issue-description> + + +In this case all subsequent comments will automatically be copied to jira without having to mention +jira issue explicitly in each comment of the PR. + + +[1]: https://help.github.com/articles/creating-a-pull-request +[2]: https://help.github.com/articles/merging-a-pull-request#merging-locally +[3]: https://help.github.com/articles/closing-issues-via-commit-messages +[4]: https://blogs.apache.org/infra/entry/improved_integration_between_apache_and \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/githubPRs.md ---------------------------------------------------------------------- diff --git a/website/_pages/githubPRs.md b/website/_pages/githubPRs.md new file mode 100644 index 0000000..80164f3 --- /dev/null +++ b/website/_pages/githubPRs.md @@ -0,0 +1,92 @@ +Title: +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +# Handling Github PRs # + +---------- + + +## how to create a PR (for contributers) + +Read [[1]]. + +Pull requests are made to apache/mahout repository on Github. + +## merging a PR and closing it (for committers). + +Remember that pull requests are equivalent to a remote branch with potentially a multitude of commits. +In this case it is recommended to squash remote commit history to have one commit per issue, rather +than merging in a multitude of contributer's commits. In order to do that, as well as close the PR at the +same time, it is recommended to use **squash commits**. + +Read [[2]] (merging locally). Merging pull requests are equivalent to merging contributor's branch: + + git checkout master # switch to local master branch + git pull apache master # fast-forward to current remote HEAD + git pull --squash https://github.com/cuser/mahout cbranch # merge to master + + +In this example we assume that contributor Github handle is "cuser" and the PR branch name is "cbranch" there. We also +assume that *apache* remote is configured as + + apache https://git-wip-us.apache.org/repos/asf/mahout.git (fetch) + apache https://git-wip-us.apache.org/repos/asf/mahout.git (push) + + +Squash pull ensures all PR history is squashed into single commit. Also, it is not yet committed, even if +fast forward is possible, so you get chance to change things before committing. + +At this point resolve conflicts, if any, or ask contributor to rebase on top of master, if PR went out of sync. + +Also run regular patch checks and change CHANGELOG. + +Suppose everything is fine, you now can commit the squashed request + + git commit -a + +edit message to contain "MAHOUT-YYYY description **closes #ZZ**", where ZZ is the pull request number. +Including "closes #ZZ" will close PR automatically. More information [[3]]. + + push apache master + +(this will require credentials). + +Note on squashing: Since squash discards remote branch history, repeated PRs from the same remote branch are +difficult for merging. The workflow implies that every new PR starts with a new rebased branch. This is more +important for contributors to know, rather than for committers, because if new PR is not mergeable, github +would warn to begin with. Anyway, watch for dupe PRs (based on same source branches). This is a bad practice. + +## Closing a PR without committing + +When we want to reject a PR (close without committing), just do the following commit on master's HEAD +*without merging the PR*: + + git commit --allow-empty -m "closes #ZZ *Won't fix*" + git push apache master + +that should close PR without merging and any code modifications in the master repository. + +## Apache/github integration features + +Read [[4]]. Issue handles mentioned in comments and PR name should post to mailing lists and Jira. + + +[1]: https://help.github.com/articles/creating-a-pull-request +[2]: https://help.github.com/articles/merging-a-pull-request#merging-locally +[3]: https://help.github.com/articles/closing-issues-via-commit-messages +[4]: https://blogs.apache.org/infra/entry/improved_integration_between_apache_and http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/glossary.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/glossary.mdtext b/website/_pages/glossary.mdtext new file mode 100644 index 0000000..95757fb --- /dev/null +++ b/website/_pages/glossary.mdtext @@ -0,0 +1,6 @@ +Title: Glossary +This is a list of common glossary terms used on both the mailing lists and +around the site. Where possible I have tried to provide a link to more +in-depth explanations from the web + +{children:excerpt=true|style=h4} http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/gsoc.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/gsoc.mdtext b/website/_pages/gsoc.mdtext new file mode 100644 index 0000000..246815e --- /dev/null +++ b/website/_pages/gsoc.mdtext @@ -0,0 +1,60 @@ +Title: GSOC + +# Google Summer of Code + +Mahout has been mentoring students in Google Summer of Code (GSoC) for as long as +the project has existed. To help students better understand what is +expected of them, this page lays out common advice, links and other tips +and tricks for successfully creating a GSoC proposal for Mahout. + +Be warned, however, that GSoC, particularly at the Apache Software +Foundation (ASF), is fairly competitive. Not only are you competing +against others within Mahout, but Mahout is competing with other projects +in the ASF. Therefore, it is very important that proposals be well +referenced and well thought out. Even if you don't get selected, consider +sticking around. Open source is fun, a great career builder and can open up many +opportunities for you. + +## Tips on Good Proposals + +* Interact with the community before proposal time. This is actually part +of how we rate proposals. Having a good idea is just one part of the +process. You must show you can communicate and work within the community +parameters. You might even consider putting up a patch or two that shows +you get how things work. See [How To Contribute](how-to-contribute.html). +* Since Machine Learning is fairly academic, be sure to cite your sources +in your proposal. +* Provide a realistic timeline. Be sure you indicate what other +obligations you have during the summer. It may seem worthwhile to lie +here, but we have failed students mid-term in the past because they did not +participate as they said they would. Failing mid-term means not getting +paid. +* Do not mail mentors off list privately unless it is something truly +personal (most things are not). This will likely decrease your chances of +being selected, not increase them. +* DO NOT BITE OFF MORE THAN YOU CAN CHEW. Every year, there are a few +students who propose to implement 3-5 machine learning algorithms on +Map/Reduce, all in a two month period. They NEVER get selected. Be +realistic. All successful projects to date follow, more or less, the +following formula: Implement algorithm on Map/Reduce. Write Unit Tests. +Do some bigger scale tests. Write 1 or 2 examples. Write Wiki +documentation. That's it. Trust us, it takes a summer to do these things. + + +## What to expect once selected + +* Just as in the proposals, almost all interaction should take place on the +mailing lists. Only personal matters related to your whereabouts or your +evaluation will take place privately. +* Show up. Ask questions. Be engaged. We don't care if you know it all +about what you are implementing. We care about you contributing to open +source. You learn. We learn. Win-win. +* Enjoy it! Contributing to open source can open some amazing doors for +your career. + +<a name="GSOC-References"></a> +## References + + * [GSoC Home](http://code.google.com/soc/) - official GSoC page + * [GSoC FAQ](http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs) - official FAQ + * [Apache GSoC coordination](http://community.apache.org/gsoc.html) - official Apache GSoC documentation, especially important if you want to become a mentor \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/how-to-become-a-committer.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/how-to-become-a-committer.mdtext b/website/_pages/how-to-become-a-committer.mdtext new file mode 100644 index 0000000..72a1b7d --- /dev/null +++ b/website/_pages/how-to-become-a-committer.mdtext @@ -0,0 +1,23 @@ +Title: How To Become A Committer + +# How to become a committer + +While there's no exact criteria for becoming a committer, there is a fairly +obvious path to becoming a committer. + +For starters, one should be familiar with the [Apache Way ](http://www.apache.org/foundation/how-it-works.html), especially the part about meritocracy. + +Second, participate in the mailing lists, help answer questions when you +can and do so in a respectful manner. This is often more important than +writing amazing code. + +Third, write code, add patches, stick with them and be patient. Add unit +tests and documentation. In general, tackling 3 or 4 decent patches is +where the bar is at, but it depends on the state of the project. In the +earlier stages of the project, the bar is a bit lower, so it pays to join +early! + +Finally, it is then up to someone to nominate them to the PMC. Typically, +one of the existing committers does this by sending an email to the private +PMC mailing list ([email protected], where m.a.o is mahout.apache.org) and then +the PMC votes on it. Nominations often occur internal to the PMC as well. http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/how-to-contribute.md ---------------------------------------------------------------------- diff --git a/website/_pages/how-to-contribute.md b/website/_pages/how-to-contribute.md new file mode 100644 index 0000000..cfedf8d --- /dev/null +++ b/website/_pages/how-to-contribute.md @@ -0,0 +1,153 @@ +--- +layout: mahout +title: How To Contribute +permalink: /How-To-Contribute/ +--- + +# How to contribute + +*Contributing to an Apache project* is about more than just writing code -- +it's about doing what you can to make the project better. There are lots +of ways to contribute! + +<a name="HowToContribute-BeInvolved"></a> +## Get Involved + +Discussions at Apache happen on the mailing list. To get involved, you should join the [Mahout mailing lists](/general/mailing-lists,-irc-and-archives.html). In particular: + +* The **user list** (to help others) +* The **development list** (to join discussions of changes) -- This is the best place +to understand where the project is headed. +* The **commit list** (to see changes as they are made) + +Please keep discussions about Mahout on list so that everyone benefits. +Emailing individual committers with questions about specific Mahout issues +is discouraged. See [http://people.apache.org/~hossman/#private_q](http://people.apache.org/~hossman/#private_q) +. Apache has a number of [email tips for contributors][1] as well. + +<a name="HowToContribute-WhattoWorkOn?"></a> +## What to Work On? + +What do you like to work on? There are a ton of things in Mahout that we +would love to have contributions for: documentation, performance improvements, better tests, etc. +The best place to start is by looking into our [issue tracker](https://issues.apache.org/jira/browse/MAHOUT) and +seeing what bugs have been reported and seeing if any look like you could +take them on. Small, well written, well tested patches are a great way to +get your feet wet. It could be something as simple as fixing a typo. The +more important piece is you are showing you understand the necessary steps +for making changes to the code. Mahout is a pretty big beast at this +point, so changes, especially from non-committers, need to be evolutionary +not revolutionary since it is often very difficult to evaluate the merits +of a very large patch. Think small, at least to start! + +Beyond JIRA, hang out on the dev@ mailing list. That's where we discuss +what we are working on in the internals and where you can get a sense of +where people are working. + +Also, documentation is a great way to familiarize yourself with the code +and is always a welcome addition to the codebase and this website. Feel free +to contribute texts and tutorials! Committers will make sure they are added +to this website, and we have a [guide for making website updates][2]. +We also have a [wide variety of books and slides][3] for learning more about +machine learning algorithms. + +If you are interested in working towards being a committer, [general guidelines are available online](/developers/how-to-become-a-committer.html). + +<a name="HowToContribute-ContributingCode(Features,BigFixes,Tests,etc...)"></a> +## Contributing Code (Features, Big Fixes, Tests, etc...) + +This section identifies the ''optimal'' steps community member can take to +submit a changes or additions to the Mahout code base. This can be new +features, bug fixes optimizations of existing features, or tests of +existing code to prove it works as advertised (and to make it more robust +against possible future changes). + +Please note that these are the "optimal" steps, and community members that +don't have the time or resources to do everything outlined on this below +should not be discouraged from submitting their ideas "as is" per "Yonik +Seeley's (Solr committer) Law of Patches": + +*A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all.* + +Just because you may not have the time to write unit tests, or cleanup +backwards compatibility issues, or add documentation, doesn't mean other +people don't. Putting your patch out there allows other people to try it +and possibly improve it. + +<a name="HowToContribute-Gettingthesourcecode"></a> +## Getting the source code + +First of all, you need to get the [Mahout source code](/developers/version-control.html). Most development is done on the "trunk". Mahout mirrors its codebase on [GitHub](https://github.com/apache/mahout). The first step to making a contribution is to fork Mahout's master branch to your GitHub repository. + + +<a name="HowToContribute-MakingChanges"></a> +## Making Changes + +Before you start, you should send a message to the [Mahout developer mailing list](/general/mailing-lists,-irc-and-archives.html) +(note: you have to subscribe before you can post), or file a ticket in our [issue tracker](/developers/issue-tracker.html). +Describe your proposed changes and check that they fit in with what others are doing and have planned for the project. Be patient, it may take folks a while to understand your requirements. + + 1. Create a JIRA Issue (if one does not already exist or you haven't already) + 2. Pull the code from your GitHub repository + 3. Ensure that you are working with the latest code from the [apache/mahout](https://github.com/apache/mahout) master branch. + 3. Modify the source code and add some (very) nice features. + - Be sure to adhere to the following points: + - All public classes and methods should have informative Javadoc + comments. + - Code should be formatted according to standard + [Java coding conventions](http://www.oracle.com/technetwork/java/codeconventions-150003.pdf), + with two exceptions: + - indent two spaces per level, not four. + - lines can be 120 characters, not 80. + - Contributions should pass existing unit tests. + - New unit tests should be provided to demonstrate bugs and fixes. + 4. Commit the changes to your local repository. + 4. Push the code back up to your GitHub repository. + 5. Create a [Pull Request](https://help.github.com/articles/creating-a-pull-request) to the to apache/mahout repository on Github. + - Include the corresponding JIRA Issue number and description in the title of the pull request: + - ie. MAHOUT-xxxx: < JIRA-Issue-Description > + 6. Committers and other members of the Mahout community can then comment on the Pull Request. Be sure to watch for comments, respond and make any necessary changes. + +Please be patient. Committers are busy people too. If no one responds to your Pull Request after a few days, please make friendly reminders on the mailing list. Please +incorporate other's suggestions into into your changes if you think they're reasonable. Finally, remember that even changes that are not committed are useful to the community. + +<a name="HowToContribute-UnitTests"></a> +#### Unit Tests + +Please make sure that all unit tests succeed before creating your Pull Request. + +Run *mvn clean test*, if you see *BUILD SUCCESSFUL* after the tests have finished, all is ok, but if you see *BUILD FAILED*, +please carefully read the errors messages and check your code. + +#### Do's and Don'ts + +Please do not: + +* reformat code unrelated to the bug being fixed: formatting changes should +be done in separate issues. +* comment out code that is now obsolete: just remove it. +* insert comments around each change, marking the change: folks can use +subversion to figure out what's changed and by whom. +* make things public which are not required by end users. + +Please do: + +* try to adhere to the coding style of files you edit; +* comment code whose function or rationale is not obvious; +* update documentation (e.g., ''package.html'' files, the website, etc.) + + +<a name="HowToContribute-Review/ImproveExistingPatches"></a> +## Review/Improve Existing Pull Requests + +If there's a JIRA issue that already has a Pull Request with changes that you think are really good, and works well for you -- please add a comment saying so. If there's room +for improvement (more tests, better javadocs, etc...) then make the changes on your GitHub branch and add a comment about them. If a lot of people review a Pull Request and give it a +thumbs up, that's a good sign for committers when deciding if it's worth spending time to review it -- and if other people have already put in +effort to improve the docs/tests for an issue, that helps even more. + +For more information see [Handling GitHub PRs](http://mahout.apache.org/developers/github.html). + + + [1]: http://www.apache.org/dev/contrib-email-tips + [2]: http://mahout.apache.org/developers/how-to-update-the-website.html + [3]: http://mahout.apache.org/general/books-tutorials-and-talks.html \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/how-to-contribute.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/how-to-contribute.mdtext b/website/_pages/how-to-contribute.mdtext new file mode 100644 index 0000000..37a6cbf --- /dev/null +++ b/website/_pages/how-to-contribute.mdtext @@ -0,0 +1,149 @@ +Title: How To Contribute + +# How to contribute + +*Contributing to an Apache project* is about more than just writing code -- +it's about doing what you can to make the project better. There are lots +of ways to contribute! + +<a name="HowToContribute-BeInvolved"></a> +## Get Involved + +Discussions at Apache happen on the mailing list. To get involved, you should join the [Mahout mailing lists](/general/mailing-lists,-irc-and-archives.html). In particular: + +* The **user list** (to help others) +* The **development list** (to join discussions of changes) -- This is the best place +to understand where the project is headed. +* The **commit list** (to see changes as they are made) + +Please keep discussions about Mahout on list so that everyone benefits. +Emailing individual committers with questions about specific Mahout issues +is discouraged. See [http://people.apache.org/~hossman/#private_q](http://people.apache.org/~hossman/#private_q) +. Apache has a number of [email tips for contributors][1] as well. + +<a name="HowToContribute-WhattoWorkOn?"></a> +## What to Work On? + +What do you like to work on? There are a ton of things in Mahout that we +would love to have contributions for: documentation, performance improvements, better tests, etc. +The best place to start is by looking into our [issue tracker](https://issues.apache.org/jira/browse/MAHOUT) and +seeing what bugs have been reported and seeing if any look like you could +take them on. Small, well written, well tested patches are a great way to +get your feet wet. It could be something as simple as fixing a typo. The +more important piece is you are showing you understand the necessary steps +for making changes to the code. Mahout is a pretty big beast at this +point, so changes, especially from non-committers, need to be evolutionary +not revolutionary since it is often very difficult to evaluate the merits +of a very large patch. Think small, at least to start! + +Beyond JIRA, hang out on the dev@ mailing list. That's where we discuss +what we are working on in the internals and where you can get a sense of +where people are working. + +Also, documentation is a great way to familiarize yourself with the code +and is always a welcome addition to the codebase and this website. Feel free +to contribute texts and tutorials! Committers will make sure they are added +to this website, and we have a [guide for making website updates][2]. +We also have a [wide variety of books and slides][3] for learning more about +machine learning algorithms. + +If you are interested in working towards being a committer, [general guidelines are available online](/developers/how-to-become-a-committer.html). + +<a name="HowToContribute-ContributingCode(Features,BigFixes,Tests,etc...)"></a> +## Contributing Code (Features, Big Fixes, Tests, etc...) + +This section identifies the ''optimal'' steps community member can take to +submit a changes or additions to the Mahout code base. This can be new +features, bug fixes optimizations of existing features, or tests of +existing code to prove it works as advertised (and to make it more robust +against possible future changes). + +Please note that these are the "optimal" steps, and community members that +don't have the time or resources to do everything outlined on this below +should not be discouraged from submitting their ideas "as is" per "Yonik +Seeley's (Solr committer) Law of Patches": + +*A half-baked patch in Jira, with no documentation, no tests and no backwards compatibility is better than no patch at all.* + +Just because you may not have the time to write unit tests, or cleanup +backwards compatibility issues, or add documentation, doesn't mean other +people don't. Putting your patch out there allows other people to try it +and possibly improve it. + +<a name="HowToContribute-Gettingthesourcecode"></a> +## Getting the source code + +First of all, you need to get the [Mahout source code](/developers/version-control.html). Most development is done on the "trunk". Mahout mirrors its codebase on [GitHub](https://github.com/apache/mahout). The first step to making a contribution is to fork Mahout's master branch to your GitHub repository. + + +<a name="HowToContribute-MakingChanges"></a> +## Making Changes + +Before you start, you should send a message to the [Mahout developer mailing list](/general/mailing-lists,-irc-and-archives.html) +(note: you have to subscribe before you can post), or file a ticket in our [issue tracker](/developers/issue-tracker.html). +Describe your proposed changes and check that they fit in with what others are doing and have planned for the project. Be patient, it may take folks a while to understand your requirements. + + 1. Create a JIRA Issue (if one does not already exist or you haven't already) + 2. Pull the code from your GitHub repository + 3. Ensure that you are working with the latest code from the [apache/mahout](https://github.com/apache/mahout) master branch. + 3. Modify the source code and add some (very) nice features. + - Be sure to adhere to the following points: + - All public classes and methods should have informative Javadoc + comments. + - Code should be formatted according to standard + [Java coding conventions](http://www.oracle.com/technetwork/java/codeconventions-150003.pdf), + with two exceptions: + - indent two spaces per level, not four. + - lines can be 120 characters, not 80. + - Contributions should pass existing unit tests. + - New unit tests should be provided to demonstrate bugs and fixes. + 4. Commit the changes to your local repository. + 4. Push the code back up to your GitHub repository. + 5. Create a [Pull Request](https://help.github.com/articles/creating-a-pull-request) to the to apache/mahout repository on Github. + - Include the corresponding JIRA Issue number and description in the title of the pull request: + - ie. MAHOUT-xxxx: < JIRA-Issue-Description > + 6. Committers and other members of the Mahout community can then comment on the Pull Request. Be sure to watch for comments, respond and make any necessary changes. + +Please be patient. Committers are busy people too. If no one responds to your Pull Request after a few days, please make friendly reminders on the mailing list. Please +incorporate other's suggestions into into your changes if you think they're reasonable. Finally, remember that even changes that are not committed are useful to the community. + +<a name="HowToContribute-UnitTests"></a> +#### Unit Tests + +Please make sure that all unit tests succeed before creating your Pull Request. + +Run *mvn clean test*, if you see *BUILD SUCCESSFUL* after the tests have finished, all is ok, but if you see *BUILD FAILED*, +please carefully read the errors messages and check your code. + +#### Do's and Don'ts + +Please do not: + +* reformat code unrelated to the bug being fixed: formatting changes should +be done in separate issues. +* comment out code that is now obsolete: just remove it. +* insert comments around each change, marking the change: folks can use +subversion to figure out what's changed and by whom. +* make things public which are not required by end users. + +Please do: + +* try to adhere to the coding style of files you edit; +* comment code whose function or rationale is not obvious; +* update documentation (e.g., ''package.html'' files, the website, etc.) + + +<a name="HowToContribute-Review/ImproveExistingPatches"></a> +## Review/Improve Existing Pull Requests + +If there's a JIRA issue that already has a Pull Request with changes that you think are really good, and works well for you -- please add a comment saying so. If there's room +for improvement (more tests, better javadocs, etc...) then make the changes on your GitHub branch and add a comment about them. If a lot of people review a Pull Request and give it a +thumbs up, that's a good sign for committers when deciding if it's worth spending time to review it -- and if other people have already put in +effort to improve the docs/tests for an issue, that helps even more. + +For more information see [Handling GitHub PRs](http://mahout.apache.org/developers/github.html). + + + [1]: http://www.apache.org/dev/contrib-email-tips + [2]: http://mahout.apache.org/developers/how-to-update-the-website.html + [3]: http://mahout.apache.org/general/books-tutorials-and-talks.html \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/how-to-release.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/how-to-release.mdtext b/website/_pages/how-to-release.mdtext new file mode 100644 index 0000000..7d84b8f --- /dev/null +++ b/website/_pages/how-to-release.mdtext @@ -0,0 +1,232 @@ +Title: How To Release + +# How To Release Mahout + + +*This page is prepared for Mahout committers. You need committer rights to +create a new Mahout release.* + +<a name="HowToRelease-ReleasePlanning"></a> +# Release Planning + +Start a discussion on mahout-dev about having a release, questions to bring +up include: + + * Any [Unresolved JIRA issues for the upcoming release ](-https://issues.apache.org/jira/secure/issuenavigator!executeadvanced.jspa?jqlquery=project+%3d+mahout+and+resolution+%3d+unresolved+and+fixversion+%3d+%220.6%22&runquery=true&clear=true.html) + * Any [Resolved or Closed JIRA issues missing a "Fix Version" ](-https://issues.apache.org/jira/secure/issuenavigator!executeadvanced.jspa?jqlquery=project+%3d+mahout+and+%28status+%3d+resolved+or+status+%3d+closed%29+and+fixversion+is+null+and+resolution+%3d+fixed&runquery=true&clear=true.html) + that should be marked as fixed in this release? + * Does any documentation need an update? + * Who is going to be the "release engineer"? + * What day should be targeted for the release ? Leave buffer time for a +code freeze and release candidate testing; make sure at least a few people +commit to having time to help test the release candidates around the target +date. + + +<a name="HowToRelease-CodeFreeze"></a> +# Code Freeze + +For 7-14 days prior to the release target date, have a "code freeze" where +committers agree to only commit things if they: + + * Are documentation improvements (including fixes to eliminate Javadoc +warnings) + * Are new test cases that improve test coverage + * Are bug fixes found because of improved test coverage + * Are new tests and bug fixes for new bugs encountered by manually testing + +<a name="HowToRelease-StepsForReleaseEngineer"></a> +# Steps For Release Engineer + +<a name="HowToRelease-Beforebuildingrelease"></a> +## Before building release +1. Check that all tests pass after a clean compile: mvn clean test +1. Check that there are no remaining unresolved Jira issues with the +upcoming version number listed as the "Fix" version +1. Publish any prev. unpublished Third Party Deps: [Thirdparty Dependencies](thirdparty-dependencies.html) + +<a name="HowToRelease-PreviewingtheArtifacts"></a> +## Previewing the Artifacts +1. To build the artifacts: +1. # mvn -Pmahout-release,apache-release,hadoop2 package + +<a name="HowToRelease-Makingarelease"></a> +## Making a release +* Check if documentation needs an update +* Update the web site's news by updating a working copy of the SVN +directory at https://svn.apache.org/repos/asf/mahout/site/new_website +* Commit these changes. It is important to do this prior to the build so +that it is reflected in the copy of the website included with the release +for documentation purposes. +* If this is your first release, add your key to the KEYS file. The KEYS +file is located on Github at +https://github.com/apache/mahout/master/distribution/KEYS and copy it +to the release directory. +Make sure you commit your change. +See http://www.apache.org/dev/release-signing.html. +* Ensure you have set up standard Apache committer settings in + ~/.m2/settings.xml as per [this page](http://maven.apache.org/developers/committer-settings.html) +. +* Add a profile to your ~/.m2/settings.xml in the <profiles> section with: + + <blockquote> + <profiles> + <profile> + <id>mahout_release</id> + <properties> + <gpg.keyname>YOUR PGP KEY NAME</gpg.keyname> + <gpg.passphrase>YOUR SIGNING PASSCODE HERE</gpg.passphrase> + +<deploy.altRepository>mahout.releases::default::https://repository.apache.org/service/local/staging/deploy/maven2/</deploy.altRepository> + <username>USERNAME</username> + +<deploy.url>https://repository.apache.org/service/local/staging/deploy/maven2/</deploy.url> + </properties> + </profile> + </profiles> +</blockquote> + +* You may also need to add the following to the <servers> section in +~/.m2/settings.xml in order to upload artifacts (as the -Dusername= +-Dpassword= didn't work for Grant for 0.8, but this did): +<blockquote> +<server> + <id>apache.releases.https</id> + <username>USERNAME</username> + <password>PASSWORD</password> +</server> +</blockquote> + +* Set environment variable MAVEN_OPTS to -Xmx1024m to ensure the tests can +run +* export _JAVA_OPTIONS="-Xmx1g" +* If you are outside the US, then svn.apache.org may not resolve to the +main US-based Subversion servers. (Compare the IP address you get for +svn.apache.org with svn.us.apache.org to see if they are different.) This +will cause problems during the release since it will create a revision and +then immediately access, but, there is a replication lag of perhaps a +minute to the non-US servers. To temporarily force using the US-based +server, edit your equivalent of /etc/hosts and map the IP address of +svn.us.apache.org to svn.apache.org. +* Create the release candidate: + + mvn -Pmahout-release,apache-release,hadoop2 release:prepare release:perform + + If you have problems authenticating to svn.apache.org, try adding to the command line + + -Dusername=\[user]\ -Dpassword=\[password\] + + If it screws up, first try doing: + + mvn -Dmahout-release,apache-release,hadoop2 release:rollback. + + followed by + + mvn -Dmahout-release,apache-release,hadoop2 release:clean + + This will likely save you time and do the right thing. You may also need to delete the tag in source control: + + git tag -d mahout-X.XX.X; git push apache :refs/tags/mahout-X.XX.X + + You may also have to rollback the version numbers in the POM files. + + If you want to skip test cases while rebuilding, use + + mvn -DpreparationGoals="clean compile" release:prepare release:perform + +* Review the artifacts, etc. on the Apache Repository (using Sonatype's +Nexus application) site: https://repository.apache.org/. + You will need to login using your ASF SVN credentials and then +browse to the staging area. +* Once you have reviewed the artifacts, you will need to "Close" out +the staging area under Nexus, which then makes the artifacts available for +others to see. + * Log in to Nexus + * Click the Staging Repositories link in the left hand menu + * Click the Mahout staged one that was just uploaded by the +release:perform target + * Click Close in the toolbar. See +https://docs.sonatype.org/display/Repository/Closing+a+Staging+Repository +for a picture + * Copy the "Repository URL" link to your email; it should be like +https://repository.apache.org/content/repositories/orgapachemahout-024/ +* Call a VOTE on [email protected]. Votes require 3 days before +passing. See Apache [release policy|http://www.apache.org/foundation/voting.html#ReleaseVotes] + for more info. +* If there's a problem, you need to unwind the release and start all +over. + <blockquote> + mvn -Pmahout-release,apache-release,hadoop2 versions:set -DnewVersion=PREVIOUS_SNAPSHOT + + mvn -Pmahout-release,apache-release,hadoop2 versions:commit + + git commit + + git push --delete apache <tagname> (deletes the remote tag) + git tag -d tagname (deletes the local tag) + +* Release the artifact in the Nexus Repository in the same way you +Closed it earlier. +* Add your key to the KEYS file at +http://www.apache.org/dist/mahout/<version>/ +* Copy the assemblies and their supporting files (tar.gz, zip, tar.bz2, +plus .asc, .md5, .pom, .sha1 files) to the ASF mirrors at: +people.apache.org:/www/www.apache.org/dist/mahout/<version>/. You should +make sure the group "mahout" owns the files and that they are read only +(-r--r--r-- in UNIX-speak). See [Guide To Distributing Existing Releases Through The ASF Mirrors|http://jakarta.apache.org/site/convert-to-mirror.html?Step-By-Step] + and the links that are there. + * cd /www/www.apache.org/dist/mahout + * mkdir <VERSION> + * cd <VERSION> + * wget -e robots=off --no-check-certificate -np -r +https://repository.apache.org/content/groups/public/org/apache/mahout/apache-mahout-distribution/<VERSION>/ + * mv +repository.apache.org/content/groups/public/org/apache/mahout/mahout-distribution/0.8/* +. + * rm -rf repository.apache.org/ + * rm index.html +* Wait 24 hours for release to propagate to mirrors. +* Clean up JIRA: Bulk close all X.Y JIRA issues. Mark the Version +number as being released (see Manage Versions.) Add the next version +(X.Y+1) if necessary. +* Update release version on http://mahout.apache.org/ and +http://en.wikipedia.org/wiki/Apache_Mahout +* +https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Update+The+Website +* Send announcements to the user and developer lists. + + + +See also: + +* http://maven.apache.org/developers/release/releasing.html +* +http://www.sonatype.com/books/nexus-book/reference/staging-sect-deployment.html +* http://www.sonatype.com/books/nexus-book/reference/index.html + + +### Post Release +## Versioning +* Create the next version in JIRA (if it doesn't already exist) +* Mark the version as "released" in JIRA (noting the release date) + +## Documentation +* Change wiki to match current best practices (remove/change deprecations, +etc) + +## Publicity +* update freshmeat +* blog away +* Update MLOSS entry: http://mloss.org/revision/view/387/. See Grant for +details. + +## Related Resources + +* http://www.apache.org/dev/#releases +* http://www.apache.org/dev/#mirror + +# TODO: Things To Cleanup in this document + +* more specifics about things to test before starting or after packaging +(RAT, run scripts against example, etc...) +* include info about [Voting | http://www.apache.org/foundation/voting.html#ReleaseVotes] \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/how-to-update-the-website.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/how-to-update-the-website.mdtext b/website/_pages/how-to-update-the-website.mdtext new file mode 100644 index 0000000..2cd09bb --- /dev/null +++ b/website/_pages/how-to-update-the-website.mdtext @@ -0,0 +1,17 @@ +Title: How To Update The Website + +# How to update the Mahout Website + +<a name="HowToUpdateTheWebsite-Howtoupdatethemahouthomepage"></a> +## How to update the mahout home page +1. If you are not a committer of Apache Mahout, please open a ticket in our [issue tracker](/developers/issue-tracker.html) and attach a text describing the changes/additions you want to contribute +1. If you are a committer: + 1. Make sure you have the [Apache CMS bookmarklet](https://cms.apache.org/#bookmark) installed. + 1. For all else refer to the [Apache CMS reference docs](http://www.apache.org/dev/cmsref.html). + +<a name="HowToUpdateTheWebsite-SomeDo'sandDont'sofupdatingthewiki"></a> +## Some Do's and Dont's of updating the web site +1. Keep all pages cleanly formatted - this includes using standard formatting for headers etc. +1. Try to keep a single page for a topic instead of starting multiple ones. +If the topics are related, put it under as a child under the similar page. +1. Notify the developers of orphaned or broken links. http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/issue-tracker.md ---------------------------------------------------------------------- diff --git a/website/_pages/issue-tracker.md b/website/_pages/issue-tracker.md new file mode 100644 index 0000000..cd9b56b --- /dev/null +++ b/website/_pages/issue-tracker.md @@ -0,0 +1,44 @@ +--- +layout: mahout +title: Issue Tracker +permalink: /issue-tracker/ +--- +# Issue tracker + + +Mahout's issue tracker is located [here](http://issues.apache.org/jira/browse/MAHOUT). +For most changes (apart from trivial stuff) Mahout works according to a review-then-commit model. +This means anything that is to be added is first presented as a patch in the issue tracker. All conversations in the issue tracker are automatically +echoed on the developer mailing list and people tend to respond or continue +conversations there rather in the issue tracker, so in order to follow an +issue you also have to read to the <a href="http://mahout.apache.org/general/mailing-lists,-irc-and-archives.html">mailing lists</a>. + +An issue does not literally have to be an issue. It could be a wish, task, +bug report, etc. and it does not have to contain a patch. + +Mahout uses [JIRA](https://confluence.atlassian.com/display/JIRA/JIRA+Documentation) by Atlassian. + +<a name="IssueTracker-Bestpractise"></a> +#### Best practices + +Don't create duplicate issues. Make sure your problem is a problem and that +nobody else already fixed it. If you are new to the project, it is often +preferred that the subject of an issue is discussed on one of our mailing +lists before an issue is created - in particular when it comes to adding new functionality. + +Quote only what it is you are responding to in comments. + +Patches should be created at trunk or trunk parent level and if possible be +a single uncompressed text file so it is easy to inspect the patch in a web +browser. (See [Patch Check List](/developers/patch-check-list.html) +.) + +Use the issue identity when referring to an issue in any discussion. +"MAHOUT-n" and not "mahout-n" or "n". MAHOUT-1 would automatically be +linked to [MAHOUT-1](http://issues.apache.org/jira/browse/MAHOUT-1) + in a better world. + +A note to committers: Make sure to mention the issue id in each commit. Not only has +JIRA the capability of auto-linking commits to the issue they are related to +that way, it also makes it easier to get further information for a specific commit +when browsing through the commit log and within the commit mailing list. http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/mahout-benchmarks.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/mahout-benchmarks.mdtext b/website/_pages/mahout-benchmarks.mdtext new file mode 100644 index 0000000..60f973e --- /dev/null +++ b/website/_pages/mahout-benchmarks.mdtext @@ -0,0 +1,148 @@ +Title: Mahout Benchmarks + +<a name="MahoutBenchmarks-Introduction"></a> +# Introduction + +Depending on hardware configuration, exact distribution of ratings over users and items YMMV! + +<a name="MahoutBenchmarks-Recommenders"></a> +# Recommenders + +<a name="MahoutBenchmarks-ARuleofThumb"></a> +## A Rule of Thumb + +100M preferences are about the data set size where non-distributed +recommenders will outgrow a normal-sized machine (32-bit, <= 4GB RAM). Your +mileage will vary significantly with the nature of the data. + +<a name="MahoutBenchmarks-Distributedrecommendervs.Wikipedialinks(May272010)"></a> +## Distributed recommender vs. Wikipedia links (May 27 2010) + +From the mailing list: + +I just finished running a set of recommendations based on the Wikipedia +link graph, for book purposes (yeah, it's unconventional). I ran on my +laptop, but it ought to be crudely representative of how it runs in a real +cluster. + +The input is 1058MB as a text file, and contains, 130M article-article +associations, from 5.7M articles to 3.8M distinct articles ("users" and +"items", respectively). I estimate cost based on Amazon's North +American small Linux-based instance pricing of $0.085/hour. I ran on a +dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is +valid. + +In this run, I run recommendations for all 5.7M "users". You can certainly +run for any subset of all users of course. + +Phase 1 (Item ID to item index mapping) +29 minutes CPU time +$0.05 +60MB output + +Phase 2 (Create user vectors) +88 minutes CPU time +$0.13 +Output: 1159MB + +Phase 3 (Count co-occurrence) +77 hours CPU time +$6.54 +Output: 23.6GB + +Phase 4 (Partial multiply prep) +10.5 hours CPU time +$0.90 +Output: 24.6GB + +Phase 5 (Aggregate and recommend) +about 600 hours +about $51.00 +about 10GB +(I estimated these rather than let it run at home for days!) + + +Note that phases 1 and 3 may be run less frequently, and need not be run +every time. But the cost is dominated by the last step, which is most of +the work. I've ignored storage costs. + +This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user +recommendations. That's not bad if, say, you want to update recs for you +site's 100,000 daily active users for a dollar. + +There are several levers one could pull internally to sacrifice accuracy +for speed, but it's currently set to pretty normal values. So this is just +one possibility. + +Now that's not terrible, but it is about 8x more computing than would be +needed by a non-distributed implementation *if* you could fit the whole +data set into a very large instance's memory, which is still possible at +this scale but needs a pretty big instance. That's a very apples-to-oranges +comparison of course; different algorithms, entirely different +environments. This is about the amount of overhead I'd expect from +distributing -- interesting to note how non-trivial it is. + +<a name="MahoutBenchmarks-Non-distributedrecommendervs.KDDCupdataset(March2011)"></a> +## Non-distributed recommender vs. KDD Cup data set (March 2011) + +(From the [email protected] mailing list) + +I've been test-driving a simple application of Mahout recommenders (the +non-distributed kind) on Amazon EC2 on the new Yahoo KDD Cup data set +(kddcup.yahoo.com). + +In the spirit of open-source, like I mentioned, I'm committing the extra +code to mahout-examples that can be used to run a Recommender on the input +and output the right format. And, I'd like to publish the rough timings +too. Find all the source in org.apache.mahout.cf.taste.example.kddcup + +<a name="MahoutBenchmarks-Track1"></a> +### Track 1 + +* m2.2xlarge instance, 34.2GB RAM / 4 cores +* Steady state memory consumption: ~19GB +* Computation time: 30 hours (wall clock-time) +* CPU time per user: ~0.43 sec +* Cost on EC2: $34.20 (!) + +(Helpful hint on cost I realized after the fact: you can almost surely get +spot instances for cheaper. The maximum price this sort of instance has +gone for as a spot instance is about $0.60/hour, vs "retail price" of +$1.14/hour.) + +Resulted in an RMSE of 29.5618 (the rating scale is 0-100), which is only +good enough for 29th place at the moment. Not terrible for "out of the box" +performance -- it's just using an item-based recommender with uncentered +cosine similarity. But not really good in absolute terms. A winning +solution is going to try to factor in time, and apply more sophisticated +techniques. The best RMSE so far is about 23. + +<a name="MahoutBenchmarks-Track2"></a> +### Track 2 + +* c1.xlarge instance: 7GB RAM / 8 cores +* Steady state memory consumption: ~3.8GB +* Computation time: 4.1 hours (wall clock-time) +* CPU time per user: ~1.1 sec +* Cost on EC2: $3.20 + +For this I bothered to write a simplistic item-item similarity metric to +take into account the additional info that is available: track, artist, +album, genre. The result was comparatively better: 17.92% error rate, good +enough for 4th place at the moment. + +Of course, the next task is to put this through the actual distributed +processing -- that's really the appropriate solution. + +This shows you can still tackle fairly impressive scale with a +non-distributed solution. These results suggest that the largest instances +available from EC2 would accomodate almost 1 billion ratings in memory. +However at that scale running a user's full recommendations would easily be +measured in seconds, not milliseconds. + +<a name="MahoutBenchmarks-Clustering"></a> +# Clustering + +See [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588) + + http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/mahout-wiki.mdtext ---------------------------------------------------------------------- diff --git a/website/_pages/mahout-wiki.mdtext b/website/_pages/mahout-wiki.mdtext new file mode 100644 index 0000000..043ba17 --- /dev/null +++ b/website/_pages/mahout-wiki.mdtext @@ -0,0 +1,194 @@ +Title: Mahout Wiki +Apache Mahout is a new Apache TLP project to create scalable, machine +learning algorithms under the Apache license. + +{toc:style=disc|minlevel=2} + +<a name="MahoutWiki-General"></a> +## General +[Overview](overview.html) + -- Mahout? What's that supposed to be? + +[Quickstart](quickstart.html) + -- learn how to quickly setup Apache Mahout for your project. + +[FAQ](faq.html) + -- Frequent questions encountered on the mailing lists. + +[Developer Resources](developer-resources.html) + -- overview of the Mahout development infrastructure. + +[How To Contribute](how-to-contribute.html) + -- get involved with the Mahout community. + +[How To Become A Committer](how-to-become-a-committer.html) + -- become a member of the Mahout development community. + +[Hadoop](http://hadoop.apache.org) + -- several of our implementations depend on Hadoop. + +[Machine Learning Open Source Software](http://mloss.org/software/) + -- other projects implementing Open Source Machine Learning libraries. + +[Mahout -- The name, history and its pronunciation](mahoutname.html) + +<a name="MahoutWiki-Community"></a> +## Community + +[Who we are](who-we-are.html) + -- who are the developers behind Apache Mahout? + +[Books, Tutorials, Talks, Articles, News, Background Reading, etc. on Mahout](books-tutorials-and-talks.html) + +[Issue Tracker](issue-tracker.html) + -- see what features people are working on, submit patches and file bugs. + +[Source Code (SVN)](https://svn.apache.org/repos/asf/mahout/) + -- [Fisheye|http://fisheye6.atlassian.com/browse/mahout] + -- download the Mahout source code from svn. + +[Mailing lists and IRC](mailing-lists,-irc-and-archives.html) + -- links to our mailing lists, IRC channel and archived design and +algorithm discussions, maybe your questions was answered there already? + +[Version Control](version-control.html) + -- where we track our code. + +[Powered By Mahout](powered-by-mahout.html) + -- who is using Mahout in production? + +[Professional Support](professional-support.html) + -- who is offering professional support for Mahout? + +[Mahout and Google Summer of Code](gsoc.html) + -- All you need to know about Mahout and GSoC. + + +[Glossary of commonly used terms and abbreviations](glossary.html) + +<a name="MahoutWiki-Installation/Setup"></a> +## Installation/Setup + +[System Requirements](system-requirements.html) + -- what do you need to run Mahout? + +[Quickstart](quickstart.html) + -- get started with Mahout, run the examples and get pointers to further +resources. + +[Downloads](downloads.html) + -- a list of Mahout releases. + +[Download and installation](buildingmahout.html) + -- build Mahout from the sources. + +[Mahout on Amazon's EC2 Service](mahout-on-amazon-ec2.html) + -- run Mahout on Amazon's EC2. + +[Mahout on Amazon's EMR](mahout-on-elastic-mapreduce.html) + -- Run Mahout on Amazon's Elastic Map Reduce + +[Integrating Mahout into an Application](mahoutintegration.html) + -- integrate Mahout's capabilities in your application. + +<a name="MahoutWiki-Examples"></a> +## Examples + +1. [ASF Email Examples](asfemail.html) + -- Examples of recommenders, clustering and classification all using a +public domain collection of 7 million emails. + +<a name="MahoutWiki-ImplementationBackground"></a> +## Implementation Background + +<a name="MahoutWiki-RequirementsandDesign"></a> +### Requirements and Design + +[Matrix and Vector Needs](matrix-and-vector-needs.html) + -- requirements for Mahout vectors. + +[Collection(De-)Serialization](collection(de-)serialization.html) + +<a name="MahoutWiki-CollectionsandAlgorithms"></a> +### Collections and Algorithms + +Learn more about [mahout-collections](mahout-collections.html) +, containers for efficient storage of primitive-type data and open hash +tables. + +Learn more about the [Algorithms](algorithms.html) + discussed and employed by Mahout. + +Learn more about the [Mahout recommender implementation](recommender-documentation.html) +. + +<a name="MahoutWiki-Utilities"></a> +### Utilities + +This section describes tools that might be useful for working with Mahout. + +[Converting Content](converting-content.html) + -- Mahout has some utilities for converting content such as logs to +formats more amenable for consumption by Mahout. +[Creating Vectors](creating-vectors.html) + -- Mahout's algorithms operate on vectors. Learn more on how to generate +these from raw data. +[Viewing Result](viewing-result.html) + -- How to visualize the result of your trained algorithms. + +<a name="MahoutWiki-Data"></a> +### Data + +[Collections](collections.html) + -- To try out and test Mahout's algorithms you need training data. We are +always looking for new training data collections. + +<a name="MahoutWiki-Benchmarks"></a> +### Benchmarks + +[Mahout Benchmarks](mahout-benchmarks.html) + +<a name="MahoutWiki-Committer'sResources"></a> +## Committer's Resources + +* [Testing](testing.html) + -- Information on test plans and ideas for testing + +<a name="MahoutWiki-ProjectResources"></a> +### Project Resources + +* [Dealing with Third Party Dependencies not in Maven](thirdparty-dependencies.html) +* [How To Update The Website](how-to-update-the-website.html) +* [Patch Check List](patch-check-list.html) +* [How To Release](http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release) +* [Release Planning](release-planning.html) +* [Sonar Code Quality Analysis](https://analysis.apache.org/dashboard/index/63921) + +<a name="MahoutWiki-AdditionalResources"></a> +### Additional Resources + +* [Apache Machine Status](http://monitoring.apache.org/status/) + \- Check to see if SVN, other resources are available. +* [Committer's FAQ](http://www.apache.org/dev/committers.html) +* [Apache Dev](http://www.apache.org/dev/) + + +<a name="MahoutWiki-HowToEditThisWiki"></a> +## How To Edit This Wiki + +How to edit this Wiki + +This Wiki is a collaborative site, anyone can contribute and share: + +* Create an account by clicking the "Login" link at the top of any page, +and picking a username and password. +* Edit any page by pressing Edit at the top of the page + +There are some conventions used on the Mahout wiki: + + * {noformat}+*TODO:*+{noformat} (+*TODO:*+ ) is used to denote sections +that definitely need to be cleaned up. + * {noformat}+*Mahout_(version)*+{noformat} (+*Mahout_0.2*+) is used to +draw attention to which version of Mahout a feature was (or will be) added +to Mahout. + http://git-wip-us.apache.org/repos/asf/mahout/blob/54ef150e/website/_pages/mailing-lists.md ---------------------------------------------------------------------- diff --git a/website/_pages/mailing-lists.md b/website/_pages/mailing-lists.md new file mode 100644 index 0000000..3569ceb --- /dev/null +++ b/website/_pages/mailing-lists.md @@ -0,0 +1,73 @@ +--- +layout: mahout +title: Mailing Lists, IRC and Archives +permalink: /mailing-lists/ +--- +# General + +Communication at Mahout happens primarily online via mailing lists. We have +a user as well as a dev list for discussion. In addition there is a commit +list so we are able to monitor what happens on the wiki and in svn. + +<a name="MailingLists,IRCandArchives-Mailinglists"></a> +# Mailing lists + +<a name="MailingLists,IRCandArchives-MahoutUserList"></a> +## Mahout User List + +This list is for users of Mahout to ask questions, share knowledge, and +discuss issues. Do send mail to this list with usage and configuration +questions and problems. Also, please send questions to this list to verify +your problem before filing issues in JIRA. + +* [Subscribe](mailto:[email protected]) +* [Unsubscribe](mailto:[email protected]) + +<a name="MailingLists,IRCandArchives-MahoutDeveloperList"></a> +## Mahout Developer List + +This is the list where participating developers of the Mahout project meet +and discuss issues concerning Mahout internals, code changes/additions, +etc. Do not send mail to this list with usage questions or configuration +questions and problems. + +Discussion list: + +* [Subscribe](mailto:[email protected]) + -- Do not send mail to this list with usage questions or configuration +questions and problems. +* [Unsubscribe](mailto:[email protected]) + +Commit notifications: + +* [Subscribe](mailto:[email protected]) +* [Unsubscribe](mailto:[email protected]) + +<a name="MailingLists,IRCandArchives-IRC"></a> +# IRC + +Mahout's IRC channel is **#mahout**. It is a logged channel. Please keep in +mind that it is for discussion purposes only and that (pseudo)decisions +should be brought back to the dev@ mailing list or JIRA and other people +who are not on IRC should be given time to respond before any work is +committed. + +<a name="MailingLists,IRCandArchives-Archives"></a> +# Archives + +<a name="MailingLists,IRCandArchives-OfficialApacheArchive"></a> +## Official Apache Archive + +* [http://mail-archives.apache.org/mod_mbox/mahout-dev/](http://mail-archives.apache.org/mod_mbox/mahout-dev/) +* [http://mail-archives.apache.org/mod_mbox/mahout-user/](http://mail-archives.apache.org/mod_mbox/mahout-user/) + +<a name="MailingLists,IRCandArchives-ExternalArchives"></a> +## External Archives + +* [MarkMail](http://mahout.markmail.org/) +* [Gmane](http://dir.gmane.org/gmane.comp.apache.mahout.user) + +Please note the inclusion of a link to an archive does not imply an +endorsement of that company by any of the committers of Mahout the Lucene +PMC or the Apache Software Foundation. Each archive owner is solely +responsible for the contents and availability of their archive.
