Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Synthetic Control Data 
(https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data)


Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h1. Introduction

The example will demonstrate clustering of control charts which exhibits a time 
series. [Control charts |http://en.wikipedia.org/wiki/Control_chart] are tools 
used to determine whether or not a manufacturing or business process is in a 
state of statistical control. Such control charts are generated / simulated 
over equal time interval and available for use in UCI machine learning 
database. The data is described [here 
|http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html].

h1. Problem description

A time series of control charts needs to be clustered into their close knit 
groups. The data set we use is synthetic and so resembles real world 
information in an anonymized format. It contains six different classes (Normal, 
Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With 
these trends occurring on the input data set, the Mahout clustering algorithm 
will cluster the data into their corresponding class buckets. At the end of 
this example, you'll get to learn how to perform clustering using Mahout. 

h1. Pre-Prep
Make sure you have the following covered before you work out the example.
# Input data set. Download it [here | 
http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data].
 
## Sample input data: 
Input consists of 600 rows and 60 columns. The rows from  1 - 100 contains 
Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info 
[here | 
http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html]
||_time||_time+x||_time+2x..||_time+60x||
|28.7812|34.4632|31.3381|31.2834|
|24.8923|25.741|27.5532|32.8217|
..
..
|35.5351|41.7067|39.1705|48.3964|38.6103|
|24.2104|41.7679|45.2228|43.7762|48.8175|
..
..


h1. Steps

* Download the data at 
[http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series].
* In $MAHOUT_HOME/, build the Job file
** The same job is used for all examples so this only needs to be done once
** mvn install
** The job will be generated in $MAHOUT_HOME/examples/target/ and it's name 
will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.3 
release, the job will be mahout-examples-0.3.job
* (Optional){footnote}This step should be skipped when using standalone 
Hadoop{footnote} Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs \-put <PATH TO DATA> testdata
* Run the Job: $HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {footnote}Substitute 
in whichever Clustring Job you want here: KMeans, Canopy, etc. See 
subdirectories of 
$MAHOUT_HOME/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/.{footnote}
** For [canopy |Canopy Clustering]:  $HADOOP_HOME/bin/hadoop jar  
$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job  
org.apache.mahout.clustering.syntheticcontrol.canopy.Job
** For [kmeans |K-Means Clustering]:  $HADOOP_HOME/bin/hadoop jar  
$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job 
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
** For [fuzzykmeans |Fuzzy K-Means]:  $HADOOP_HOME/bin/hadoop jar  
$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job 
org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
** For [dirichlet |Dirichlet Process Clustering]: $HADOOP_HOME/bin/hadoop jar  
$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
** For [meanshift |Mean Shift Clustering]: $HADOOP_HOME/bin/hadoop jar  
$MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job 
org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
* Get the data out of HDFS{footnote}See [HDFS Shell | 
http://hadoop.apache.org/core/docs/current/hdfs_shell.html]{footnote}{footnote}The
 output directory is cleared when a new run starts so the results must be 
retrieved before a new run{footnote} and have a look{footnote}Dirichlet also 
prints data to console{footnote}
** All example jobs use _testdata_ as input and output to directory _output_
** Use _bin/hadoop fs \-lsr output_ to view all outputs. Copy them all to your 
local machine and you can run the ClusterDumper on them.
*** Sequence files containing the original points in Vector form are in 
_output/data_
*** Computed clusters are contained in _output/clusters-i_
*** All result clustered points are placed into _output/clusteredPoints_

{display-footnotes}

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to