[ 
https://issues.apache.org/jira/browse/MAHOUT-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999450#comment-12999450
 ] 

Frank Scholten commented on MAHOUT-612:
---------------------------------------

Robin: Maybe I understand what you mean about serializing the config. At the 
moment the mappers and reducers still need to access values in the 
Configuration object via the config keys. Is it possible turn the 
(KMeans|Canopy)Configuration into a simple pojo, have it implement Writable and 
serialize it inside the Configuration and deserialize it at the mapper and 
reducer? Or does this have performance implications or other consequences?

We could maybe make a method in (KMeans|Canopy)Configuration

public Configuration asConfiguration() { ... }

where it serializes itself inside a Configuration and then returns it.

> Simplify configuring and running Mahout MapReduce jobs from Java using Java 
> bean configuration
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-612
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-612
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Frank Scholten
>             Fix For: 0.5
>
>         Attachments: MAHOUT-612-canopy.patch, MAHOUT-612-v2.patch, 
> MAHOUT-612.patch
>
>
> Most of the Mahout features require running several jobs in sequence. This 
> can be done via the command line or using one of the driver classes.
> Running and configuring a Mahout job from Java requires using either the 
> Driver's static methods or creating a String array of parameters and pass 
> them to the main method of the job. If we can instead configure jobs through 
> a Java bean or factory we it will be type safe and easier to use in by DI 
> frameworks such as Spring and Guice.
> I have added a patch where I factored out a KMeans MapReduce job plus a 
> configuration Java bean, from KMeansDriver.buildClustersMR(...)
> * The KMeansMapReduceConfiguration takes care of setting up the correct 
> values in the Hadoop Configuration object and initializes defaults. I copied 
> the config keys from KMeansConfigKeys.
> * The KMeansMapReduceJob contains the code for the actual algorithm running 
> all iterations of KMeans and returns the KMeansMapReduceConfiguration, which 
> contains the cluster path for the final iteration.
> I like to extend this approach to other Hadoop jobs for instance the job for 
> creating points in KMeansDriver, but I first want some feedback on this. 
> One of the benefits of this approach is that it becomes easier to chain jobs. 
> For instance we can chain Canopy to KMeans by connecting the output dir of 
> Canopy's configuration to the input dir of the configuration of the KMeans 
> job next in the chain. Hadoop's JobControl class can then be used to connect 
> and execute the entire chain.
> This approach can be further improved by turning the configuration bean into 
> a factory for creating MapReduce or sequential jobs. This would probably 
> remove some duplicated code in the KMeansDriver.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to