Re: Composing Mahout workflow (Re: Improving Our JIRA State)

Ted Dunning Thu, 27 Oct 2011 07:12:14 -0700

On Thu, Oct 27, 2011 at 5:55 AM, Dan Brickley <[email protected]> wrote:


> On 27 October 2011 12:54, Frank Scholten <[email protected]> wrote:
> > On Thu, Oct 27, 2011 at 9:27 AM, Ted Dunning <[email protected]>
> wrote:
> >
> >>> There's some talk about beanifying our workflow steps in
> >>> https://issues.apache.org/jira/browse/MAHOUT-612, but I can't say I
> >>> understand how this would allow us to reach the composable workflow
> >>> goal.
> >>
> >> I don't think it does.  It just passes data around in files like we do
> now.
> >>
> >
> > Yes, MAHOUT-612 has beans for configuring kmeans, canopy and
> > seq2sparse and the configuration is serialized and deserialized at the
> > mappers and reducers. But it does not have a workflow engine or
> > something like that. You have to connect the inputs of one job to the
> > outputs of the other.
>
> Any sense for whether having all the inputs/outputs written to hdfs is
> a big problem, versus trying to plug things together in code so that
> it doesn't all get serialized? I mean, how much is it worth trying to
> do the latter instead of using the filesystem for integration.
>

THe major place I see that this cost is bad is where you have feature
extraction that could be folded into a map phase.

Re: Composing Mahout workflow (Re: Improving Our JIRA State)

Reply via email to