I am not sure I would ever expect there to be a common format across all
> jobs. They just don't all operate on the same information. Even
> where two jobs ingest "vectors", it doesn't mean vectors for one are
> meaningful for another.
>

Machine learning has quite a few algorithms where data is processed in a way
foreign to its domain. Running SVD on user/item/preference matrices is a
great example: this makes no sense whatsoever.

This proposal in no way recommends a "common format" across all jobs. the
jobs all have their own i/o format, and that would say. Under this proposal,
you can ask a job to also emit its data in one of the common formats. The
semantics don't matter.

The best justification for this is FPGrowth. It emits a custom object,
TopKStringPatterns. If I am interested in processing only one aspect of it's
full data structure, I cannot ask it to emit part of that structure. If,
say, I want to collate the graph of short patterns, I'm stuck. Without
writing a custom Java program, the output goes nowhere. (I'm really
interested if anybody has ever done anything with the output.)

Another simpler use case is the system of making Vector files out of Lucene
analysis output. The Lanzcos distributed solver takes matrices instead of
vectors. What if I want to run the Lucene vector file output through this
SVD? I have to somehow turn my (named) vectors into a (labeled) matrix.

If you spot cases where two jobs really ingest the same input, and do
> not have the same input format, they could surely be unified. But
> that's better tackled by identifying the case(s), make a JIRA, and
> patch it.
>
> For generalization I think this is a bridge too far. Another layer of
> options and metadata specifying what sub-types can be imported and
> exported with what caveats, etc.? The jobs aren't even consistent in
> the version of Hadoop they use -- we have some still on 0.19.x.
>

Does that require a different SequenceFile? Eep.


>
> On Mon, Sep 12, 2011 at 4:14 AM, Lance Norskog <goks...@gmail.com> wrote:
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Import+Export+Sequence+File+Formats
> >
> > Please have a look; comment or rewrite as you please. It's a wish list of
> > what I would want, approaching Mahout either as an experienced user or as
> a
> > newbie.
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to