Repository: mahout Updated Branches: refs/heads/master 2367fde82 -> 1199e5117
[MAHOUT-1956] Update README.md to include build instructions and an example for GPU/GPU/JVM Project: http://git-wip-us.apache.org/repos/asf/mahout/repo Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/9eb68e53 Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/9eb68e53 Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/9eb68e53 Branch: refs/heads/master Commit: 9eb68e5319ad6f7692779e566206a159f98be8b9 Parents: 0496194 Author: Andrew Palumbo <[email protected]> Authored: Sun Mar 19 14:33:04 2017 -0700 Committer: Andrew Palumbo <[email protected]> Committed: Sun Mar 19 14:33:04 2017 -0700 ---------------------------------------------------------------------- README.md | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 196 insertions(+), 12 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mahout/blob/9eb68e53/README.md ---------------------------------------------------------------------- diff --git a/README.md b/README.md index 1b058ae..c51db40 100644 --- a/README.md +++ b/README.md @@ -30,11 +30,6 @@ export MAHOUT_LOCAL=true # for running standalone on your dev machine, ``` You will need a `$JAVA_HOME`, and if you are running on Spark, you will also need `$SPARK_HOME` -Note when running the spark-shell job it can help to set some JVM options so you don't run out of memory: -``` -$MAHOUT_OPTS="-Xmx6g -XX:MaxPermSize=512m" mahout spark-shell -``` - ####Using Mahout as a Library Running any application that uses Mahout will require installing a binary or source version and setting the environment. To compile from source: @@ -61,25 +56,214 @@ and a dependency for back end engine translation, e.g: <version>${mahout.version}</version> </dependency> ``` +####Building From Source + +######Prerequisites: + +Linux Environment (preferably Ubuntu 16.04.x) Note: Currently only the JVM-only build will work on a Mac. +gcc > 4.x +NVIDIA Card (installed with OpenCL drivers alongside usual GPU drivers) + +######Downloads + +Install java 1.7+ in an easily accessible directory (for this example, ~/java/) +http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html + +Create a directory ~/apache/ . + +Download apache Maven 3.3.9 and un-tar/gunzip to ~/apache/apache-maven-3.3.9/ . +https://maven.apache.org/download.cgi + +Download and un-tar/gunzip Hadoop 2.4.1 to ~/apache/hadoop-2.4.1/ . +https://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/ + +Download and un-tar/gunzip spark-1.6.3-bin-hadoop2.4 to ~/apache/ . +http://spark.apache.org/downloads.html +Choose release: Spark-1.6.3 (Nov 07 2016) +Choose package type: Pre-Built for Hadoop 2.4 + +Install ViennaCL 1.7.0+ +If running Ubuntu 16.04+ + ``` -<dependency> - <groupId>org.apache.mahout</groupId> - <artifactId>mahout-flink_2.10</artifactId> - <version>${mahout.version}</version> -</dependency> +sudo apt-get install libviennacl-dev +``` + +Otherwise if your distributionâs package manager does not have a viennniacl-dev package >1.7.0, clone it directly into the directory which will be included in when being compiled by Mahout: + +``` +mkdir ~/tmp +cd ~/tmp && git clone https://github.com/viennacl/viennacl-dev.git +cp -r viennacl/ /usr/local/ +cp -r CL/ /usr/local/ +``` + +Ensure that the OpenCL 1.2+ drivers are installed (packed with most consumer grade NVIDIA drivers). Not sure about higher end cards. + +Clone mahout repository into `~/apache`. + ``` +git clone https://github.com/apache/mahout.git +``` + +######CONFIGURATION + +When building mahout for a spark backend, we need four System Environment variables set: +``` + export MAHOUT_HOME=/home/<user>/apache/mahout + export HADOOP_HOME=/home/<user>/apache/hadoop-2.4.1 + export SPARK_HOME=/home/<user>/apache/spark-1.6.3-bin-hadoop2.4 + export JAVA_HOME=/home/<user>/java/jdk-1.8.121 +``` + +Mahout on Spark regularly uses one more env variable, the IP of the Spark clusterâs master node (usually the node which one would be logged into). + +To use 4 local cores (Spark master need not be running) +``` +export MASTER=local[4] +``` +To use all available local cores (again, Spark master need not be running) +``` +export MASTER=local[*] +``` +To point to a cluster with spark running: +``` +export MASTER=spark://master.ip.address:7077 +``` + +We then add these to the path: + +``` + PATH=$PATH$:MAHOUT_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin +``` + +These should be added to the your ~/.bashrc file. + + +######BUILDING MAHOUT WITH APACHE MAVEN + +Currently Mahout has 3 builds. From the $MAHOUT_HOME directory we may issue the commands to build each using mvn profiles. + +JVM only: +``` +mvn clean install -DskipTests +``` + +JVM with native OpenMP level 2 and level 3 matrix/vector Multiplication +``` +mvn clean install -Pviennacl-omp -Phadoop2 -DskipTests +``` +JVM with native OpenMP and OpenCL for Level 2 and level 3 matrix/vector Multiplication. (GPU errors fall back to OpenMP, currently only a single GPU/node is supported). +``` +mvn clean install -Pviennacl -Phadoop2 -DskipTests +``` + +####TESTING THE MAHOUT ENVIRONMENT + +Mahout provides an extension to the spark-shell, which is good for getting to know the language, testing partition loads, prototyping algorithms, etc.. + +To launch the shell in local mode with 2 threads: simply do the following: +``` +$ MASTER=local[2] mahout spark-shell +``` + +After a very verbose startup, a Mahout welcome screen will appear: + +``` +Loading /home/andy/sandbox/apache-mahout-distribution-0.13.0/bin/load-shell.scala... +import org.apache.mahout.math._ +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.scalabindings.RLikeOps._ +import org.apache.mahout.math.drm.RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ +sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@3ca1f0a4 + + _ _ +_ __ ___ __ _| |__ ___ _ _| |_ + '_ ` _ \ / _` | '_ \ / _ \| | | | __| + | | | | (_| | | | | (_) | |_| | |_ +_| |_| |_|\__,_|_| |_|\___/ \__,_|\__| version 0.13.0 + + +That file does not exist + + +scala> + +At the scala> prompt, enter: + +scala> :load home/<andy>/apache/mahout/examples + /bin/SparseSparseDrmTimer.mscala + +Which will load a matrix multiplication timer function definition. To run the matrix timer: + + scala> timeSparseDRMMMul(1000,1000,1000,1,.02,1234L) + {...} res3: Long = 16321 + +We can see that the JVM only version is rather slow, thus our motive for GPU and Native Multithreading support. + +To get an idea of whatâs going on under the hood of the timer, we may examine the .mscala (mahout scala) code which is both fully functional scala and the Mahout R-Like DSL for tensor algebra: + + + + +def timeSparseDRMMMul(m: Int, n: Int, s: Int, para: Int, pctDense: Double = .20, seed: Long = 1234L): Long = { + val drmA = drmParallelizeEmpty(m , s, para).mapBlock(){ + case (keys,block:Matrix) => + val R = scala.util.Random + R.setSeed(seed) + val blockB = new SparseRowMatrix(block.nrow, block.ncol) + blockB := {x => if (R.nextDouble > pctDense) R.nextDouble else x } + (keys -> blockB) + } + + val drmB = drmParallelizeEmpty(s , n, para).mapBlock(){ + case (keys,block:Matrix) => + val R = scala.util.Random + R.setSeed(seed + 1) + val blockB = new SparseRowMatrix(block.nrow, block.ncol) + blockB := {x => if (R.nextDouble > pctDense) R.nextDouble else x } + (keys -> blockB) + } + + var time = System.currentTimeMillis() + + val drmC = drmA %*% drmB + + // trigger computation + drmC.numRows() + + time = System.currentTimeMillis() - time + + time + +} +``` + +For more information please see the following references: + +http://mahout.apache.org/users/environment/in-core-reference.html +http://mahout.apache.org/users/environment/out-of-core-reference.html +http://mahout.apache.org/users/sparkbindings/play-with-shell.html +http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html + + + Note that due to an intermittent out-of-memory bug in a Flink test we have disabled it from the binary releases. To use Flink please uncomment the line in the root pom.xml in the `<modules>` block so it reads `<module>flink</module>`. -####Examples +####EXAMPLES For examples of how to use Mahout, see the examples directory located in `examples/bin` For information on how to contribute, visit the [How to Contribute Page](https://mahout.apache.org/developers/how-to-contribute.html) -####Legal +####LEGAL Please see the `NOTICE.txt` included in this directory for more information. [](https://travis-ci.org/apache/mahout) +<!-- [](https://coveralls.io/github/apache/mahout?branch=master) +--> \ No newline at end of file
