Re: HBase backed matrices
So if rows are small, blob is probably better; and if they get larger I can make blocks of blobs. I will experiment this. On Wed, May 8, 2013 at 1:06 AM, Ted Dunning wrote: > It really depends on your access patterns. > > Blob storage of rows will be much faster for scans and will take much less > space. > > Column storage of values may or may not make things faster, but it is > conceptually nicer to not have to update so much. In practice, I am not > convinced that you will notice the difference except for really big rows. > > Remember that you don't have to commit to a single choice. You could use a > rolled up representation most of the time and then break the rollups in to > regions as they get bigger. > > > On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan wrote: > > > Nope, > > > > I simply thought that would make accessing and setting individual cells > > more difficult. > > > > Should I? Do you think it would perform better? And I would want to hear > if > > you have more design choices in your mind. > > > > > > On Wed, May 8, 2013 at 12:22 AM, Ted Dunning > > wrote: > > > > > Have you experimented with, for instance, row number as id, value as > > binary > > > serialized vector? > > > > > > > > > > > > > > > On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan > wrote: > > > > > > > 2 options: > > > > > > > > 1- row index as the row key, column index as column identifier, and > > value > > > > as value > > > > 2- row index and column index combined as the row key, and value in a > > > > column called "value" > > > > > > > > Row indices are kept in a member variable in memory, to make > iteration > > > > fast. > > > > > > > > > > > > > > > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning > > > > wrote: > > > > > > > > > How did you store the matrix in HBase? > > > > > > > > > > > > > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > For taking large matrices as input and persisting large models > > (like > > > > > factor > > > > > > models), I created an HBase-backed version of Mahout matrix. > > > > > > > > > > > > It allows random access to cells and rows as well as assignment, > > and > > > > > > iteration over rows. viewRow returns a view, and lazy loads > actual > > > data > > > > > if > > > > > > a get is actually invoked. > > > > > > > > > > > > I plan to add a VectorInputFormat on top of it, too. > > > > > > > > > > > > The code that we need to have for our algorithms is tested, but > > there > > > > are > > > > > > still parts of it that are not. > > > > > > > > > > > > I am going to speak about this at HBaseCon, and I wanted to let > you > > > > know > > > > > > that it can be contributed after some refactoring. > > > > > > > > > > > > Is there any interest? > > > > > > > > > > > > -- > > > > > > Gokhan > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Gokhan > > > > > > > > > > > > > > > -- > > Gokhan > > > -- Gokhan
Re: HBase backed matrices
It really depends on your access patterns. Blob storage of rows will be much faster for scans and will take much less space. Column storage of values may or may not make things faster, but it is conceptually nicer to not have to update so much. In practice, I am not convinced that you will notice the difference except for really big rows. Remember that you don't have to commit to a single choice. You could use a rolled up representation most of the time and then break the rollups in to regions as they get bigger. On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan wrote: > Nope, > > I simply thought that would make accessing and setting individual cells > more difficult. > > Should I? Do you think it would perform better? And I would want to hear if > you have more design choices in your mind. > > > On Wed, May 8, 2013 at 12:22 AM, Ted Dunning > wrote: > > > Have you experimented with, for instance, row number as id, value as > binary > > serialized vector? > > > > > > > > > > On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan wrote: > > > > > 2 options: > > > > > > 1- row index as the row key, column index as column identifier, and > value > > > as value > > > 2- row index and column index combined as the row key, and value in a > > > column called "value" > > > > > > Row indices are kept in a member variable in memory, to make iteration > > > fast. > > > > > > > > > > > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning > > > wrote: > > > > > > > How did you store the matrix in HBase? > > > > > > > > > > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan > > wrote: > > > > > > > > > Hi, > > > > > > > > > > For taking large matrices as input and persisting large models > (like > > > > factor > > > > > models), I created an HBase-backed version of Mahout matrix. > > > > > > > > > > It allows random access to cells and rows as well as assignment, > and > > > > > iteration over rows. viewRow returns a view, and lazy loads actual > > data > > > > if > > > > > a get is actually invoked. > > > > > > > > > > I plan to add a VectorInputFormat on top of it, too. > > > > > > > > > > The code that we need to have for our algorithms is tested, but > there > > > are > > > > > still parts of it that are not. > > > > > > > > > > I am going to speak about this at HBaseCon, and I wanted to let you > > > know > > > > > that it can be contributed after some refactoring. > > > > > > > > > > Is there any interest? > > > > > > > > > > -- > > > > > Gokhan > > > > > > > > > > > > > > > > > > > > > -- > > > Gokhan > > > > > > > > > -- > Gokhan >
Re: HBase backed matrices
Nope, I simply thought that would make accessing and setting individual cells more difficult. Should I? Do you think it would perform better? And I would want to hear if you have more design choices in your mind. On Wed, May 8, 2013 at 12:22 AM, Ted Dunning wrote: > Have you experimented with, for instance, row number as id, value as binary > serialized vector? > > > > > On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan wrote: > > > 2 options: > > > > 1- row index as the row key, column index as column identifier, and value > > as value > > 2- row index and column index combined as the row key, and value in a > > column called "value" > > > > Row indices are kept in a member variable in memory, to make iteration > > fast. > > > > > > > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning > > wrote: > > > > > How did you store the matrix in HBase? > > > > > > > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan > wrote: > > > > > > > Hi, > > > > > > > > For taking large matrices as input and persisting large models (like > > > factor > > > > models), I created an HBase-backed version of Mahout matrix. > > > > > > > > It allows random access to cells and rows as well as assignment, and > > > > iteration over rows. viewRow returns a view, and lazy loads actual > data > > > if > > > > a get is actually invoked. > > > > > > > > I plan to add a VectorInputFormat on top of it, too. > > > > > > > > The code that we need to have for our algorithms is tested, but there > > are > > > > still parts of it that are not. > > > > > > > > I am going to speak about this at HBaseCon, and I wanted to let you > > know > > > > that it can be contributed after some refactoring. > > > > > > > > Is there any interest? > > > > > > > > -- > > > > Gokhan > > > > > > > > > > > > > > > -- > > Gokhan > > > -- Gokhan
Re: HBase backed matrices
Have you experimented with, for instance, row number as id, value as binary serialized vector? On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan wrote: > 2 options: > > 1- row index as the row key, column index as column identifier, and value > as value > 2- row index and column index combined as the row key, and value in a > column called "value" > > Row indices are kept in a member variable in memory, to make iteration > fast. > > > > On Wed, May 8, 2013 at 12:11 AM, Ted Dunning > wrote: > > > How did you store the matrix in HBase? > > > > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan wrote: > > > > > Hi, > > > > > > For taking large matrices as input and persisting large models (like > > factor > > > models), I created an HBase-backed version of Mahout matrix. > > > > > > It allows random access to cells and rows as well as assignment, and > > > iteration over rows. viewRow returns a view, and lazy loads actual data > > if > > > a get is actually invoked. > > > > > > I plan to add a VectorInputFormat on top of it, too. > > > > > > The code that we need to have for our algorithms is tested, but there > are > > > still parts of it that are not. > > > > > > I am going to speak about this at HBaseCon, and I wanted to let you > know > > > that it can be contributed after some refactoring. > > > > > > Is there any interest? > > > > > > -- > > > Gokhan > > > > > > > > > -- > Gokhan >
Re: HBase backed matrices
2 options: 1- row index as the row key, column index as column identifier, and value as value 2- row index and column index combined as the row key, and value in a column called "value" Row indices are kept in a member variable in memory, to make iteration fast. On Wed, May 8, 2013 at 12:11 AM, Ted Dunning wrote: > How did you store the matrix in HBase? > > > On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan wrote: > > > Hi, > > > > For taking large matrices as input and persisting large models (like > factor > > models), I created an HBase-backed version of Mahout matrix. > > > > It allows random access to cells and rows as well as assignment, and > > iteration over rows. viewRow returns a view, and lazy loads actual data > if > > a get is actually invoked. > > > > I plan to add a VectorInputFormat on top of it, too. > > > > The code that we need to have for our algorithms is tested, but there are > > still parts of it that are not. > > > > I am going to speak about this at HBaseCon, and I wanted to let you know > > that it can be contributed after some refactoring. > > > > Is there any interest? > > > > -- > > Gokhan > > > -- Gokhan
Re: HBase backed matrices
How did you store the matrix in HBase? On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan wrote: > Hi, > > For taking large matrices as input and persisting large models (like factor > models), I created an HBase-backed version of Mahout matrix. > > It allows random access to cells and rows as well as assignment, and > iteration over rows. viewRow returns a view, and lazy loads actual data if > a get is actually invoked. > > I plan to add a VectorInputFormat on top of it, too. > > The code that we need to have for our algorithms is tested, but there are > still parts of it that are not. > > I am going to speak about this at HBaseCon, and I wanted to let you know > that it can be contributed after some refactoring. > > Is there any interest? > > -- > Gokhan >
HBase backed matrices
Hi, For taking large matrices as input and persisting large models (like factor models), I created an HBase-backed version of Mahout matrix. It allows random access to cells and rows as well as assignment, and iteration over rows. viewRow returns a view, and lazy loads actual data if a get is actually invoked. I plan to add a VectorInputFormat on top of it, too. The code that we need to have for our algorithms is tested, but there are still parts of it that are not. I am going to speak about this at HBaseCon, and I wanted to let you know that it can be contributed after some refactoring. Is there any interest? -- Gokhan