On Nov 16, 2009, at 3:54 AM, Jeff Zhang wrote:
> Hi all,
>
> I start learning hbase these days. and I found we can use hbase for machine
> learning.
> In the field of machine learning, we always need to handle matrix and vector
> which is very fit to be stored in hbase.
>
> e.g. we always have to compute the doc-term matrix in text classification.
> If we use hbase, we can store each document as a row in hbase, and store the
> document id as the row id ,and tf (term frequency) as columns.
> e.g. we have one document A titled "love", and the content is:
> I love this game.
>
> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use hbase
> rather than the raw data.
>
> This is currently what I think of, maybe there's something not correct, Hope
> to hear ideas from experts.
>
If you check out the classification algorithms in Mahout, they have HBase as a
storage option. Feedback on them would be appreciated.
I tend to think about being agnostic of underlying storage as much as possible.
-Grant