Re: Have a idea of leveraging hbase for machine learning

Andrew Purtell Mon, 16 Nov 2009 11:50:53 -0800

Ted,

> within a column family that the number of distinct columns is nearly
> irrelevant and that the only cost is incurred by the columns that are
> actually present.  This would make hbase roughly equivalent to a file based
> representation that stores the labels for all non-zero elements in a row.

Correct.

   - Andy

________________________________
From: Ted Dunning <[email protected]>
To: [email protected]
Cc: [email protected]
Sent: Mon, November 16, 2009 11:17:34 AM
Subject: Re: Have a idea of leveraging hbase for machine learning

Andrew,

This is very good news.  The rapid progress of hbase recently is really
exciting (and hard to keep up with).

The performance bounds in the sparse matrices we are liable to see are
usually forced by a very few very long row vectors (sparsity >10% in some
cases) and by the vast majority of very sparse row vectors (sparsity <<
0.1%).  One or the other of these typically causes problems in
representation and algorithms.

With Hbase, I would be worried about whether it would be efficient to store
the sparse rows with only a few entries.  It sounds like you are saying that
within a column family that the number of distinct columns is nearly
irrelevant and that the only cost is incurred by the columns that are
actually present.  This would make hbase roughly equivalent to a file based
representation that stores the labels for all non-zero elements in a row.

Is my understanding of what you said correct?

On Mon, Nov 16, 2009 at 10:45 AM, Andrew Purtell <[email protected]>wrote:

> > Practically speaking, it probalby isn't feasible to have an hbase column
> per matrix column
>
> Just in case that is predicated on old information: Distinguishing between
> columns and column qualifiers, it is architecturally feasible to have a
> single column family with millions of values in a row with distinct
> qualifiers. Someone with more depth in this space than I could say if order
> of millions is sufficient for handling very large sparse matrices, how
> compelling that might be. In practical terms, HBASE-1537 (
> https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large
> rows with scanners possible with current svn trunk or upcoming release
> 0.21.0. (Chunked get of a single row is not currently under consideration.)
> Previously, indeed due to a limitation of implementation trying to retrieve
> order of a million values stored in a column would have either blown up the
> region server or the client due to the need to pack all the data into a
> single RPC buffer.

-- 
Ted Dunning, CTO
DeepDyve

Re: Have a idea of leveraging hbase for machine learning

Reply via email to