Hi all,
I start learning hbase these days. and I found we can use hbase for machine
learning.
In the field of machine learning, we always need to handle matrix and vector
which is very fit to be stored in hbase.
e.g. we always have to compute the doc-term matrix in text classification.
If we use hbase, we can store each document as a row in hbase, and store the
document id as the row id ,and tf (term frequency) as columns.
e.g. we have one document A titled "love", and the content is:
I love this game.
Then we can store them as one hbase row:
A: {tilte:love=>1,
content:I=>1,content:love=>1,content:this=>1,content:game=>1}
Using hbase, it will be very easy for us to compute the similarity between
documents.
And another advantage of hbase compared to raw text data is that it's
semi-structured. And I think it will be easy for programming if we use hbase
rather than the raw data.
This is currently what I think of, maybe there's something not correct, Hope
to hear ideas from experts.
Thank you.
Jeff Zhang