It really depends on your access patterns.

Blob storage of rows will be much faster for scans and will take much less
space.

Column storage of values may or may not make things faster, but it is
conceptually nicer to not have to update so much.  In practice, I am not
convinced that you will notice the difference except for really big rows.

Remember that you don't have to commit to a single choice.  You could use a
rolled up representation most of the time and then break the rollups in to
regions as they get bigger.


On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan <gkhn...@gmail.com> wrote:

> Nope,
>
> I simply thought that would make accessing and setting individual cells
> more difficult.
>
> Should I? Do you think it would perform better? And I would want to hear if
> you have more design choices in your mind.
>
>
> On Wed, May 8, 2013 at 12:22 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Have you experimented with, for instance, row number as id, value as
> binary
> > serialized vector?
> >
> >
> >
> >
> > On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
> >
> > > 2 options:
> > >
> > > 1- row index as the row key, column index as column identifier, and
> value
> > > as value
> > > 2- row index and column index combined as the row key, and value in a
> > > column called "value"
> > >
> > > Row indices are kept in a member variable in memory, to make iteration
> > > fast.
> > >
> > >
> > >
> > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > >
> > > > How did you store the matrix in HBase?
> > > >
> > > >
> > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan <gkhn...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > For taking large matrices as input and persisting large models
> (like
> > > > factor
> > > > > models), I created an HBase-backed version of Mahout matrix.
> > > > >
> > > > > It allows random access to cells and rows as well as assignment,
> and
> > > > > iteration over rows. viewRow returns a view, and lazy loads actual
> > data
> > > > if
> > > > > a get is actually invoked.
> > > > >
> > > > > I plan to add a VectorInputFormat on top of it, too.
> > > > >
> > > > > The code that we need to have for our algorithms is tested, but
> there
> > > are
> > > > > still parts of it that are not.
> > > > >
> > > > > I am going to speak about this at HBaseCon, and I wanted to let you
> > > know
> > > > > that it can be contributed after some refactoring.
> > > > >
> > > > > Is there any interest?
> > > > >
> > > > > --
> > > > > Gokhan
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Gokhan
> > >
> >
>
>
>
> --
> Gokhan
>

Reply via email to