Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
I think It changed after Jeff commit his code. It was there earlier.


On Mon, Apr 26, 2010 at 12:24 AM, Sean Owen  wrote:

> Where though, I just deleted all the methods to try it and every test
> passes.
>
> On Sun, Apr 25, 2010 at 7:51 PM, Robin Anil  wrote:
> > Its used in clustering to generate clusterid -> point id. Also to be used
> in
> > classification(by end of this summer) to keep class labels.
>


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
Where though, I just deleted all the methods to try it and every test passes.

On Sun, Apr 25, 2010 at 7:51 PM, Robin Anil  wrote:
> Its used in clustering to generate clusterid -> point id. Also to be used in
> classification(by end of this summer) to keep class labels.


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
On Mon, Apr 26, 2010 at 12:17 AM, Sean Owen  wrote:

> I agree that it'd be good to kind of finalize the Vector stuff. I
> don't think it's reasonable for users to expect data output by 0.3 to
> be compatible with 0.4 though, so wouldn't worry about that.
>
> I think we're on the verge of wanting a proper serialization system
> like Avro for vectors here -- but not quite. About 3 flags describe
> any vector: denseness, sequential access-ness, and whether it has a
> name, if you want to unify that too. A simple byte of bit flags seems
> not so bad, if that's about as complex as this will ever get.
>
> What about label bindings, which I brought up earlier?
> Actually, I cannot find where labels are used except in tests. They're
> not serialized or cloned consistently. Are these used? Seems like the
> reason to package them together would be serialization but that's not
> it.
>
Its used in clustering to generate clusterid -> point id. Also to be used in
classification(by end of this summer) to keep class labels.

>
> On Sun, Apr 25, 2010 at 4:36 PM, Robin Anil  wrote:
> > Let more comments come in before tearing it down. This affects
> everything.
> > We *have to *get it right by the next release, not necessarily today or
> > tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we
> can
> > provide a convertor to convert to the new representation.
>


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
I agree that it'd be good to kind of finalize the Vector stuff. I
don't think it's reasonable for users to expect data output by 0.3 to
be compatible with 0.4 though, so wouldn't worry about that.

I think we're on the verge of wanting a proper serialization system
like Avro for vectors here -- but not quite. About 3 flags describe
any vector: denseness, sequential access-ness, and whether it has a
name, if you want to unify that too. A simple byte of bit flags seems
not so bad, if that's about as complex as this will ever get.

What about label bindings, which I brought up earlier?
Actually, I cannot find where labels are used except in tests. They're
not serialized or cloned consistently. Are these used? Seems like the
reason to package them together would be serialization but that's not
it.

On Sun, Apr 25, 2010 at 4:36 PM, Robin Anil  wrote:
> Let more comments come in before tearing it down. This affects everything.
> We *have to *get it right by the next release, not necessarily today or
> tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we can
> provide a convertor to convert to the new representation.


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
Vector is simply any one of (array of doubles) or array of(int:double) and
this info and other stuff are stored in a MetadataWritable. Makes sense to
me, assuming MetadataWritable allows us to skip over efficiently without
Deserializing


On Sun, Apr 25, 2010 at 8:58 PM, Sean Owen  wrote:

> Yes, I think if we can convince ourselves that there won't be that
> many different possibilities for representing a vector, then a simple
> boolean might unify everything. This approach doesn't 'scale' but I
> don't know there are other representations we must have.
>
> The issue of named vectors is interesting. There's not really such a
> thing as an optional field in Hadoop serialization. You can fake it
> with a boolean but that starts to be messy.
>
> Messy might be necessary as vectors perhaps take on more metadata --
> though I can't envision much more. So perhaps it is right and proper
> to retain a second serialization format, in NamedVectorWritable, which
> is really the "vector with metadata" serializer versus
> VectorWritable's "pure vector" serializer.
>
> It has a logic to me. It gets rid of writing the class name which is
> indeed unpalatable.
>
> Thoughts before I go tearing through again?
>
Let more comments come in before tearing it down. This affects everything.
We *have to *get it right by the next release, not necessarily today or
tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we can
provide a convertor to convert to the new representation.

Robin


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
Yes, I think if we can convince ourselves that there won't be that
many different possibilities for representing a vector, then a simple
boolean might unify everything. This approach doesn't 'scale' but I
don't know there are other representations we must have.

The issue of named vectors is interesting. There's not really such a
thing as an optional field in Hadoop serialization. You can fake it
with a boolean but that starts to be messy.

Messy might be necessary as vectors perhaps take on more metadata --
though I can't envision much more. So perhaps it is right and proper
to retain a second serialization format, in NamedVectorWritable, which
is really the "vector with metadata" serializer versus
VectorWritable's "pure vector" serializer.

It has a logic to me. It gets rid of writing the class name which is
indeed unpalatable.

Thoughts before I go tearing through again?


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
>
>
> - How about moving label bindings out to NamedVector?
> - How about similar restructuring of matrices?
>
I dont know what the correct choice is here.  It depends on whether we
should keep a single written representation for all vectors on disk. Then an
optional field could be there for name


- And how about not writing
> "org.apache.mahout.math.RandomAccessSparseVectorWritable" whenever
> VectorWritable does its wrapping.. I think making the package name and
> "Writable" implicit is perhaps worth the loss of generality.
>
Agreed. If we fix the on-disk representation.. The only need is a bool which
says whether the dimensions are stored in sorted manner and another bool
which tells whether the vector is dense. So, dense vector and sequential
access vector could be deserialized in a faster manner(if the conditions are
good). But we keep the same written format for all vectors and say what
format we want to deserialize the vector into explicitly to the algorithm

Robin


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
PS let's see a patch to keep discussing, I'm seeing ideas on lots of
good topics here and want to take the opportunity to strike while the
iron is hot and continue overhauling this.

But things like making everything a named vector is sort of stepping
backwards to where we just agreed to move from -- making name a
default part of all vectors.

I am also not sure it is practical to use only VectorWritable because
of the storage overhead, though it does in fact seem to offer the very
facility alluded to in talk of a 'facade' class? I think doing things
like writing optional data in Hadoop's basic serialization format is
not really possible. I saw attempts in the previous code which felt
fragile: read a string, if it's the class name, assume it is the name
of the vector class to deserialize, otherwise assume it's a vector
name... hmm.

So are we on the same page about how this works now. In fact I would
expect to see implementations start to specialize to one particular
representation, if possible, to be more efficient.


On this topic, sort of:

- How about moving label bindings out to NamedVector?
- How about similar restructuring of matrices?
- And how about not writing
"org.apache.mahout.math.RandomAccessSparseVectorWritable" whenever
VectorWritable does its wrapping.. I think making the package name and
"Writable" implicit is perhaps worth the loss of generality.


Re: How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Sean Owen
NamedVectorWritable already extends VectorWritable, though honestly I
don't like that and kept it to minimize disruption.

Serialized vector formats aren't exactly "polymorphic". I can't read
and X vector with the code intended to deserialize something that
extends X. So, really the Writables shouldn't inherit from one
another.

VectorWritable is like a meta-format. It writes the class name for
XVector, then serializes with class XVectorWritable. This works for
any vector, so is a good default choice as it will read/write any
vector.

However that's a serious storage overhead. Writing the class name with
every instance?  For example I don't use VectorWritable in my most
recent rewrite of co-occurrence based recommenders, and use the
Writable for my vector format directly, since it saved lots of I/O.

So while VectorWritable exists as a nice default, I don't think it's a
great idea to use in practice. Its generality comes at a price.

Erm, now I've lost track. What was the question? is it moot, answered?



On Sat, Apr 24, 2010 at 8:12 PM, Ted Dunning  wrote:
> Put in other words, this would mean that there is either one or two output
> formats but most importantly only one input format that would always read
> NamedVectorWritables, possibly by inserting default names.  Due to
> inheritance, those objects would serve both purposes.
>
> That sounds good and simple.
>
> On Sat, Apr 24, 2010 at 11:31 AM, Robin Anil  wrote:
>
>> > For algorithms that are accepting arguments of a particular type, it
>> might
>> > be reasonable to let NVW extend VW (I am not at all sure about the
>> > unintended consequences of this, but it sounds plausible).   Then all we
>> > need is a facade that exposes an NVW interface for a wrapped
>> VectorWritable
>> > with some kind of default labels (say the indexes as strings).
>> >
>> Or the other way around. Let everything be a NamedVectorWritable. during
>> deserializing use explicit methods to use or skip the name
>


Re: How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Ted Dunning
Put in other words, this would mean that there is either one or two output
formats but most importantly only one input format that would always read
NamedVectorWritables, possibly by inserting default names.  Due to
inheritance, those objects would serve both purposes.

That sounds good and simple.

On Sat, Apr 24, 2010 at 11:31 AM, Robin Anil  wrote:

> > For algorithms that are accepting arguments of a particular type, it
> might
> > be reasonable to let NVW extend VW (I am not at all sure about the
> > unintended consequences of this, but it sounds plausible).   Then all we
> > need is a facade that exposes an NVW interface for a wrapped
> VectorWritable
> > with some kind of default labels (say the indexes as strings).
> >
> Or the other way around. Let everything be a NamedVectorWritable. during
> deserializing use explicit methods to use or skip the name


Re: How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Robin Anil
On Sat, Apr 24, 2010 at 11:50 PM, Ted Dunning  wrote:

> If we are talking about the Writable aspect of this, then whatever input
> format we use should reasonably be able to handle both kinds of data with
> the conversions as you suggest.
>
Yes, Having two separate writable classes as of the moment creates this
issue.

>
> For algorithms that are accepting arguments of a particular type, it might
> be reasonable to let NVW extend VW (I am not at all sure about the
> unintended consequences of this, but it sounds plausible).   Then all we
> need is a facade that exposes an NVW interface for a wrapped VectorWritable
> with some kind of default labels (say the indexes as strings).
>
Or the other way around. Let everything be a NamedVectorWritable. during
deserializing use explicit methods to use or skip the name


>
> On Sat, Apr 24, 2010 at 11:04 AM, Robin Anil  wrote:
>
> > Some algorithms are using NamedVectorWritable, Some using VectorWritable.
> > Shouldn't we need an identity convertor for forward and some form of
> naming
> > assign convertor for backward conversion. Otherwise its going to be messy
> >
> > Robin
> >
>


Re: How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Ted Dunning
If we are talking about the Writable aspect of this, then whatever input
format we use should reasonably be able to handle both kinds of data with
the conversions as you suggest.

For algorithms that are accepting arguments of a particular type, it might
be reasonable to let NVW extend VW (I am not at all sure about the
unintended consequences of this, but it sounds plausible).   Then all we
need is a facade that exposes an NVW interface for a wrapped VectorWritable
with some kind of default labels (say the indexes as strings).

On Sat, Apr 24, 2010 at 11:04 AM, Robin Anil  wrote:

> Some algorithms are using NamedVectorWritable, Some using VectorWritable.
> Shouldn't we need an identity convertor for forward and some form of naming
> assign convertor for backward conversion. Otherwise its going to be messy
>
> Robin
>