Re: HBase backed matrices

2013-05-07 Thread Gokhan Capan
So if rows are small, blob is probably better; and if they get larger I can
make blocks of blobs. I will experiment this.


On Wed, May 8, 2013 at 1:06 AM, Ted Dunning  wrote:

> It really depends on your access patterns.
>
> Blob storage of rows will be much faster for scans and will take much less
> space.
>
> Column storage of values may or may not make things faster, but it is
> conceptually nicer to not have to update so much.  In practice, I am not
> convinced that you will notice the difference except for really big rows.
>
> Remember that you don't have to commit to a single choice.  You could use a
> rolled up representation most of the time and then break the rollups in to
> regions as they get bigger.
>
>
> On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan  wrote:
>
> > Nope,
> >
> > I simply thought that would make accessing and setting individual cells
> > more difficult.
> >
> > Should I? Do you think it would perform better? And I would want to hear
> if
> > you have more design choices in your mind.
> >
> >
> > On Wed, May 8, 2013 at 12:22 AM, Ted Dunning 
> > wrote:
> >
> > > Have you experimented with, for instance, row number as id, value as
> > binary
> > > serialized vector?
> > >
> > >
> > >
> > >
> > > On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan 
> wrote:
> > >
> > > > 2 options:
> > > >
> > > > 1- row index as the row key, column index as column identifier, and
> > value
> > > > as value
> > > > 2- row index and column index combined as the row key, and value in a
> > > > column called "value"
> > > >
> > > > Row indices are kept in a member variable in memory, to make
> iteration
> > > > fast.
> > > >
> > > >
> > > >
> > > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning 
> > > > wrote:
> > > >
> > > > > How did you store the matrix in HBase?
> > > > >
> > > > >
> > > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan 
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > For taking large matrices as input and persisting large models
> > (like
> > > > > factor
> > > > > > models), I created an HBase-backed version of Mahout matrix.
> > > > > >
> > > > > > It allows random access to cells and rows as well as assignment,
> > and
> > > > > > iteration over rows. viewRow returns a view, and lazy loads
> actual
> > > data
> > > > > if
> > > > > > a get is actually invoked.
> > > > > >
> > > > > > I plan to add a VectorInputFormat on top of it, too.
> > > > > >
> > > > > > The code that we need to have for our algorithms is tested, but
> > there
> > > > are
> > > > > > still parts of it that are not.
> > > > > >
> > > > > > I am going to speak about this at HBaseCon, and I wanted to let
> you
> > > > know
> > > > > > that it can be contributed after some refactoring.
> > > > > >
> > > > > > Is there any interest?
> > > > > >
> > > > > > --
> > > > > > Gokhan
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Gokhan
> > > >
> > >
> >
> >
> >
> > --
> > Gokhan
> >
>



-- 
Gokhan


Re: HBase backed matrices

2013-05-07 Thread Ted Dunning
It really depends on your access patterns.

Blob storage of rows will be much faster for scans and will take much less
space.

Column storage of values may or may not make things faster, but it is
conceptually nicer to not have to update so much.  In practice, I am not
convinced that you will notice the difference except for really big rows.

Remember that you don't have to commit to a single choice.  You could use a
rolled up representation most of the time and then break the rollups in to
regions as they get bigger.


On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan  wrote:

> Nope,
>
> I simply thought that would make accessing and setting individual cells
> more difficult.
>
> Should I? Do you think it would perform better? And I would want to hear if
> you have more design choices in your mind.
>
>
> On Wed, May 8, 2013 at 12:22 AM, Ted Dunning 
> wrote:
>
> > Have you experimented with, for instance, row number as id, value as
> binary
> > serialized vector?
> >
> >
> >
> >
> > On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan  wrote:
> >
> > > 2 options:
> > >
> > > 1- row index as the row key, column index as column identifier, and
> value
> > > as value
> > > 2- row index and column index combined as the row key, and value in a
> > > column called "value"
> > >
> > > Row indices are kept in a member variable in memory, to make iteration
> > > fast.
> > >
> > >
> > >
> > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning 
> > > wrote:
> > >
> > > > How did you store the matrix in HBase?
> > > >
> > > >
> > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan 
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > For taking large matrices as input and persisting large models
> (like
> > > > factor
> > > > > models), I created an HBase-backed version of Mahout matrix.
> > > > >
> > > > > It allows random access to cells and rows as well as assignment,
> and
> > > > > iteration over rows. viewRow returns a view, and lazy loads actual
> > data
> > > > if
> > > > > a get is actually invoked.
> > > > >
> > > > > I plan to add a VectorInputFormat on top of it, too.
> > > > >
> > > > > The code that we need to have for our algorithms is tested, but
> there
> > > are
> > > > > still parts of it that are not.
> > > > >
> > > > > I am going to speak about this at HBaseCon, and I wanted to let you
> > > know
> > > > > that it can be contributed after some refactoring.
> > > > >
> > > > > Is there any interest?
> > > > >
> > > > > --
> > > > > Gokhan
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Gokhan
> > >
> >
>
>
>
> --
> Gokhan
>


Re: HBase backed matrices

2013-05-07 Thread Gokhan Capan
Nope,

I simply thought that would make accessing and setting individual cells
more difficult.

Should I? Do you think it would perform better? And I would want to hear if
you have more design choices in your mind.


On Wed, May 8, 2013 at 12:22 AM, Ted Dunning  wrote:

> Have you experimented with, for instance, row number as id, value as binary
> serialized vector?
>
>
>
>
> On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan  wrote:
>
> > 2 options:
> >
> > 1- row index as the row key, column index as column identifier, and value
> > as value
> > 2- row index and column index combined as the row key, and value in a
> > column called "value"
> >
> > Row indices are kept in a member variable in memory, to make iteration
> > fast.
> >
> >
> >
> > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning 
> > wrote:
> >
> > > How did you store the matrix in HBase?
> > >
> > >
> > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > For taking large matrices as input and persisting large models (like
> > > factor
> > > > models), I created an HBase-backed version of Mahout matrix.
> > > >
> > > > It allows random access to cells and rows as well as assignment, and
> > > > iteration over rows. viewRow returns a view, and lazy loads actual
> data
> > > if
> > > > a get is actually invoked.
> > > >
> > > > I plan to add a VectorInputFormat on top of it, too.
> > > >
> > > > The code that we need to have for our algorithms is tested, but there
> > are
> > > > still parts of it that are not.
> > > >
> > > > I am going to speak about this at HBaseCon, and I wanted to let you
> > know
> > > > that it can be contributed after some refactoring.
> > > >
> > > > Is there any interest?
> > > >
> > > > --
> > > > Gokhan
> > > >
> > >
> >
> >
> >
> > --
> > Gokhan
> >
>



-- 
Gokhan


Re: HBase backed matrices

2013-05-07 Thread Ted Dunning
Have you experimented with, for instance, row number as id, value as binary
serialized vector?




On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan  wrote:

> 2 options:
>
> 1- row index as the row key, column index as column identifier, and value
> as value
> 2- row index and column index combined as the row key, and value in a
> column called "value"
>
> Row indices are kept in a member variable in memory, to make iteration
> fast.
>
>
>
> On Wed, May 8, 2013 at 12:11 AM, Ted Dunning 
> wrote:
>
> > How did you store the matrix in HBase?
> >
> >
> > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan  wrote:
> >
> > > Hi,
> > >
> > > For taking large matrices as input and persisting large models (like
> > factor
> > > models), I created an HBase-backed version of Mahout matrix.
> > >
> > > It allows random access to cells and rows as well as assignment, and
> > > iteration over rows. viewRow returns a view, and lazy loads actual data
> > if
> > > a get is actually invoked.
> > >
> > > I plan to add a VectorInputFormat on top of it, too.
> > >
> > > The code that we need to have for our algorithms is tested, but there
> are
> > > still parts of it that are not.
> > >
> > > I am going to speak about this at HBaseCon, and I wanted to let you
> know
> > > that it can be contributed after some refactoring.
> > >
> > > Is there any interest?
> > >
> > > --
> > > Gokhan
> > >
> >
>
>
>
> --
> Gokhan
>


Re: HBase backed matrices

2013-05-07 Thread Gokhan Capan
2 options:

1- row index as the row key, column index as column identifier, and value
as value
2- row index and column index combined as the row key, and value in a
column called "value"

Row indices are kept in a member variable in memory, to make iteration fast.



On Wed, May 8, 2013 at 12:11 AM, Ted Dunning  wrote:

> How did you store the matrix in HBase?
>
>
> On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan  wrote:
>
> > Hi,
> >
> > For taking large matrices as input and persisting large models (like
> factor
> > models), I created an HBase-backed version of Mahout matrix.
> >
> > It allows random access to cells and rows as well as assignment, and
> > iteration over rows. viewRow returns a view, and lazy loads actual data
> if
> > a get is actually invoked.
> >
> > I plan to add a VectorInputFormat on top of it, too.
> >
> > The code that we need to have for our algorithms is tested, but there are
> > still parts of it that are not.
> >
> > I am going to speak about this at HBaseCon, and I wanted to let you know
> > that it can be contributed after some refactoring.
> >
> > Is there any interest?
> >
> > --
> > Gokhan
> >
>



-- 
Gokhan


Re: HBase backed matrices

2013-05-07 Thread Ted Dunning
How did you store the matrix in HBase?


On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan  wrote:

> Hi,
>
> For taking large matrices as input and persisting large models (like factor
> models), I created an HBase-backed version of Mahout matrix.
>
> It allows random access to cells and rows as well as assignment, and
> iteration over rows. viewRow returns a view, and lazy loads actual data if
> a get is actually invoked.
>
> I plan to add a VectorInputFormat on top of it, too.
>
> The code that we need to have for our algorithms is tested, but there are
> still parts of it that are not.
>
> I am going to speak about this at HBaseCon, and I wanted to let you know
> that it can be contributed after some refactoring.
>
> Is there any interest?
>
> --
> Gokhan
>


HBase backed matrices

2013-05-07 Thread Gokhan Capan
Hi,

For taking large matrices as input and persisting large models (like factor
models), I created an HBase-backed version of Mahout matrix.

It allows random access to cells and rows as well as assignment, and
iteration over rows. viewRow returns a view, and lazy loads actual data if
a get is actually invoked.

I plan to add a VectorInputFormat on top of it, too.

The code that we need to have for our algorithms is tested, but there are
still parts of it that are not.

I am going to speak about this at HBaseCon, and I wanted to let you know
that it can be contributed after some refactoring.

Is there any interest?

-- 
Gokhan