On Thu, Oct 27, 2011 at 5:55 AM, Dan Brickley <[email protected]> wrote:
> On 27 October 2011 12:54, Frank Scholten <[email protected]> wrote: > > On Thu, Oct 27, 2011 at 9:27 AM, Ted Dunning <[email protected]> > wrote: > > > >>> There's some talk about beanifying our workflow steps in > >>> https://issues.apache.org/jira/browse/MAHOUT-612, but I can't say I > >>> understand how this would allow us to reach the composable workflow > >>> goal. > >> > >> I don't think it does. It just passes data around in files like we do > now. > >> > > > > Yes, MAHOUT-612 has beans for configuring kmeans, canopy and > > seq2sparse and the configuration is serialized and deserialized at the > > mappers and reducers. But it does not have a workflow engine or > > something like that. You have to connect the inputs of one job to the > > outputs of the other. > > Any sense for whether having all the inputs/outputs written to hdfs is > a big problem, versus trying to plug things together in code so that > it doesn't all get serialized? I mean, how much is it worth trying to > do the latter instead of using the filesystem for integration. > THe major place I see that this cost is bad is where you have feature extraction that could be folded into a map phase.
