Ted, > within a column family that the number of distinct columns is nearly > irrelevant and that the only cost is incurred by the columns that are > actually present. This would make hbase roughly equivalent to a file based > representation that stores the labels for all non-zero elements in a row.
Correct. - Andy ________________________________ From: Ted Dunning <[email protected]> To: [email protected] Cc: [email protected] Sent: Mon, November 16, 2009 11:17:34 AM Subject: Re: Have a idea of leveraging hbase for machine learning Andrew, This is very good news. The rapid progress of hbase recently is really exciting (and hard to keep up with). The performance bounds in the sparse matrices we are liable to see are usually forced by a very few very long row vectors (sparsity >10% in some cases) and by the vast majority of very sparse row vectors (sparsity << 0.1%). One or the other of these typically causes problems in representation and algorithms. With Hbase, I would be worried about whether it would be efficient to store the sparse rows with only a few entries. It sounds like you are saying that within a column family that the number of distinct columns is nearly irrelevant and that the only cost is incurred by the columns that are actually present. This would make hbase roughly equivalent to a file based representation that stores the labels for all non-zero elements in a row. Is my understanding of what you said correct? On Mon, Nov 16, 2009 at 10:45 AM, Andrew Purtell <[email protected]>wrote: > > Practically speaking, it probalby isn't feasible to have an hbase column > per matrix column > > Just in case that is predicated on old information: Distinguishing between > columns and column qualifiers, it is architecturally feasible to have a > single column family with millions of values in a row with distinct > qualifiers. Someone with more depth in this space than I could say if order > of millions is sufficient for handling very large sparse matrices, how > compelling that might be. In practical terms, HBASE-1537 ( > https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large > rows with scanners possible with current svn trunk or upcoming release > 0.21.0. (Chunked get of a single row is not currently under consideration.) > Previously, indeed due to a limitation of implementation trying to retrieve > order of a million values stored in a column would have either blown up the > region server or the client due to the need to pack all the data into a > single RPC buffer. -- Ted Dunning, CTO DeepDyve
