> We should talk about whether it isn't better to just serialize chunks of > vectors instead.
That's actually is a better option, i mentioned that as a preferrable solution venue. Just splice the rows. But that brings up an issue of data prep utils, so that would require more coding than adding a capability to the VW. On Mon, Dec 13, 2010 at 4:54 PM, Ted Dunning <[email protected]> wrote: > We should talk about whether it isn't better to just serialize chunks of > vectors instead. > > This would not require fancy footwork and perversion of the contracts that > WritableVector thinks it has now. It would also be bound to be a better > match for the problem domain. > > So what about just using VW as it stands for vectors that are at most 100K > elements and are keyed by row and left-most column? Why isn't that the > better option? > > On Mon, Dec 13, 2010 at 4:41 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > Thank you, Jake. > > > > Yes i will do that tonight. I did do a brief assessment of that code > before > > but not very thoroughly though which was why i thought it must be a low > > hanging fruit; but Ted and you certainly should know better so i need to > > look at it the second time to be sure. Thank you, sir. > > > > -Dima > > > > On Mon, Dec 13, 2010 at 4:37 PM, Jake Mannix <[email protected]> > > wrote: > > > > > Check the source for VectorWritable, I'm pretty sure it serializes > > > in the order of the nonDefaultIterator(), which for SASVectors is in > > order, > > > so while these are indeed non-optimal for random access and mutating > > > operations, that is indeed the tradeoff you have to make when picking > > > your vector impl. > > > > > > -jake > > > > > > On Mon, Dec 13, 2010 at 4:30 PM, Dmitriy Lyubimov <[email protected]> > > > wrote: > > > > > > > Yes, it should be. I thought Ted implied VectorWritable does it only > > this > > > > way and non other. > > > > > > > > If we can differentiate I'd rather do it. Implying that if you save > in > > > one > > > > format (non-sequential) we'd support it with caveat that it's subpar > in > > > > certain cases whereas where you want to format input sequentially, > we'd > > > > eliminate vector prebuffering stage. Yes, that will work. Thank you, > > > Jake. > > > > > > > > -d > > > > > > > > > > > > On Mon, Dec 13, 2010 at 4:26 PM, Jake Mannix <[email protected]> > > > > wrote: > > > > > > > > > Dmitriy, > > > > > > > > > > You should be able to specify that your matrices be stored in > > > > > SequentialAccessSparseVector format if you need to. This is > > > > > almost always the right thing for HDFS-backed matrices, because > > > > > HDFS is write-once, and SASVectors are optimized for read-only > > > > > sequential access, which is your exact use case, right? > > > > > > > > > > -jake > > > > > > > > > > On Mon, Dec 13, 2010 at 4:21 PM, Dmitriy Lyubimov < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > I don't think sequentiality is a requirement in the case i am > > working > > > > on. > > > > > > However, let me peek at the code first. I am guessing it is some > > form > > > > of > > > > > a > > > > > > near-perfect hash, in which case it may not be possible to read > it > > in > > > > > parts > > > > > > at all. Which would be bad, indeed. I would need to find a > > completely > > > > > > alternative input format then to overcome my case. > > > > > > > > > > > > On Mon, Dec 13, 2010 at 4:01 PM, Ted Dunning < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > I don't thikn that sequentiality part of the contract. > > > > > > > RandomAccessSparseVectors are likely to > > > > > > > produce disordered values when serialized, I think. > > > > > > > > > > > > > > On Mon, Dec 13, 2010 at 1:48 PM, Dmitriy Lyubimov < > > > [email protected] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I will have to look at details of VectorWritable to make sure > > all > > > > > cases > > > > > > > are > > > > > > > > covered (I only took a very brief look so far). But as long > as > > it > > > > is > > > > > > able > > > > > > > > to > > > > > > > > produce elements in order of index increase, push technique > > will > > > > > > > certainly > > > > > > > > work for most algorithms (and in some cases, notably with > SSVD, > > > > even > > > > > if > > > > > > > it > > > > > > > > produces the data in non-sequential way, it would work too ) > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
