Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Synthetic Control Data (https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data)
Edited by Joe Prasanna Kumar: --------------------------------------------------------------------- h1. Introduction The example will demonstrate clustering of control charts which exhibits a time series. [Control charts |http://en.wikipedia.org/wiki/Control_chart] are tools used to determine whether or not a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated over equal time interval and available for use in UCI machine learning database. The data is described [here |http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html]. h1. Problem description A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and so resembles real world information in an anonymized format. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets. At the end of this example, you'll get to learn how to perform clustering using Mahout. h1. Pre-Prep Make sure you have the following covered before you work out the example. # Input data set. Download it [here | http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data]. ## Sample input data: Input consists of 600 rows and 60 columns. The rows from 1 - 100 contains Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info [here | http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html] ||_time||_time+x||_time+2x..||_time+60x|| |28.7812|34.4632|31.3381|31.2834| |24.8923|25.741|27.5532|32.8217| .. .. |35.5351|41.7067|39.1705|48.3964|38.6103| |24.2104|41.7679|45.2228|43.7762|48.8175| .. .. h1. Steps * Download the data at [http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series]. * In $MAHOUT_HOME/, build the Job file ** The same job is used for all examples so this only needs to be done once ** mvn install ** The job will be generated in $MAHOUT_HOME/examples/target/ and it's name will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.3 release, the job will be mahout-examples-0.3.job * (Optional){footnote}This step should be skipped when using standalone Hadoop{footnote} Start up Hadoop: $HADOOP_HOME/bin/start-all.sh * Put the data: $HADOOP_HOME/bin/hadoop fs \-put <PATH TO DATA> testdata * Run the Job: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {footnote}Substitute in whichever Clustring Job you want here: KMeans, Canopy, etc. See subdirectories of $MAHOUT_HOME/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/.{footnote} ** For [canopy |Canopy Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job ** For [kmeans |K-Means Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job ** For [fuzzykmeans |Fuzzy K-Means]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job ** For [dirichlet |Dirichlet Process Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job ** For [meanshift |Mean Shift Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job * Get the data out of HDFS{footnote}See [HDFS Shell | http://hadoop.apache.org/core/docs/current/hdfs_shell.html]{footnote}{footnote}The output directory is cleared when a new run starts so the results must be retrieved before a new run{footnote} and have a look{footnote}Dirichlet also prints data to console{footnote} ** All example jobs use _testdata_ as input and output to directory _output_ ** Use _bin/hadoop fs \-lsr output_ to view all outputs. Copy them all to your local machine and you can run the ClusterDumper on them. *** Sequence files containing the original points in Vector form are in _output/data_ *** Computed clusters are contained in _output/clusters-i_ *** All result clustered points are placed into _output/clusteredPoints_ {display-footnotes} Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
