We should talk about whether it isn't better to just serialize chunks of
vectors instead.

This would not require fancy footwork and perversion of the contracts that
WritableVector thinks it has now.  It would also be bound to be a better
match for the problem domain.

So what about just using VW as it stands for vectors that are at most 100K
elements and are keyed by row and left-most column?  Why isn't that the
better option?

On Mon, Dec 13, 2010 at 4:41 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Thank you, Jake.
>
> Yes i will do that tonight. I did do a brief assessment of that code before
> but not very thoroughly though which was why i thought it must be a low
> hanging fruit; but Ted and you certainly should know better so  i need to
> look at it the second time to be sure. Thank you, sir.
>
> -Dima
>
> On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <[email protected]>
> wrote:
>
> > Check the source for VectorWritable, I'm pretty sure it serializes
> > in the order of the nonDefaultIterator(), which for SASVectors is in
> order,
> > so while these are indeed non-optimal for random access and mutating
> > operations, that is indeed the tradeoff you have to make when picking
> > your vector impl.
> >
> >  -jake
> >
> > On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > > Yes, it should be. I thought Ted implied VectorWritable does it only
> this
> > > way and non other.
> > >
> > > If we can differentiate I'd rather do it. Implying that if you save in
> > one
> > > format (non-sequential) we'd support it with caveat that it's subpar in
> > > certain cases whereas where you want to format input sequentially, we'd
> > > eliminate vector prebuffering stage. Yes, that will work. Thank you,
> > Jake.
> > >
> > > -d
> > >
> > >
> > > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[email protected]>
> > > wrote:
> > >
> > > > Dmitriy,
> > > >
> > > >  You should be able to specify that your matrices be stored in
> > > > SequentialAccessSparseVector format if you need to.  This is
> > > > almost always the right thing for HDFS-backed matrices, because
> > > > HDFS is write-once, and SASVectors are optimized for read-only
> > > > sequential access, which is your exact use case, right?
> > > >
> > > >  -jake
> > > >
> > > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I don't think sequentiality is a requirement in the case i am
> working
> > > on.
> > > > > However, let me peek at the code first. I am guessing it is some
> form
> > > of
> > > > a
> > > > > near-perfect hash, in which case it may not be possible to read it
> in
> > > > parts
> > > > > at all. Which would be bad, indeed. I would need to find a
> completely
> > > > > alternative input format then to overcome my case.
> > > > >
> > > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > I don't thikn that sequentiality part of the contract.
> > > > > >  RandomAccessSparseVectors are likely to
> > > > > > produce disordered values when serialized, I think.
> > > > > >
> > > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I will have to look at details of VectorWritable to make sure
> all
> > > > cases
> > > > > > are
> > > > > > > covered (I only took a very brief look so far). But as long as
> it
> > > is
> > > > > able
> > > > > > > to
> > > > > > > produce elements in order of index increase, push technique
> will
> > > > > > certainly
> > > > > > > work for most algorithms (and in some cases, notably with SSVD,
> > > even
> > > > if
> > > > > > it
> > > > > > > produces the data in non-sequential way, it would work too ) .
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to