In this case, your best bet may be to come up with an ID structure for these messages that incorporates (leads with) the timestamp; then have Lucene use that as the key when retrieving any given message. For example, the ID could consist of:
{timestamp} + {unique id} (Beware: if you're going to load data with this schema in real time, you'll hot spot one region server; see http://hbase.apache.org/book.html#timeseries for considerations related to this.) Then, you can either scan over all data from one time period, or GET a particular message by this (combined) unique ID. There are also types of UUIDs that work in this way. But, with that much data, you may want to tune it to get the smallest possible row key; depending on the granularity of your timestamp and how unique the "unique" part really needs to be, you might be able to get this down to < 16 bytes. (Consider that the smallest possible unique representation of 100B items is 36 bits - that is, log base 2 of 10 billion; but because you also want time to be a part of it, you probably can't get anywhere near that small). If you need to scan over LOTS of data (as opposed to just looking up single messages, or small sequential chunks of messages), consider just writing the data to a file in HDFS and using map/reduce to process it. Scanning all 100B of your records won't be possible in any short time frame (by my estimate that would take about 10 hours), but you could do that with map/reduce using an asynchronous model. One table is still best for this; read up on what Regions are and why they mean you don't need multiple tables for the same data: http://hbase.apache.org/book.html#regions.arch There are no secondary indexes in HBase: http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene for this, it'd need its own storage (though there are indeed projects that run Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase). Ian On Dec 5, 2012, at 9:28 PM, tgh wrote: Thank you for your reply And I want to access the data with lucene search engine, that is, with key to retrieve any message, and I also want to get one hour data together, so I think to split data table into one hour , or if I can store it in one big table, is it better than store in 365 table or store in 365*24 table, which one is best for my data access schema, and I am also confused about how to make secondary index in hbase , if I have use some key words search engine , lucene or other Could you help me Thank you ------------- Tian Guanhua -----邮件原件----- 发件人: user-return-32247-guanhua.tian=ia.ac...@hbase.apache.org<mailto:user-return-32247-guanhua.tian=ia.ac...@hbase.apache.org> [mailto:user-return-32247-guanhua.tian=ia.ac...@hbase.apache.org] 代表 Ian Varley 发送时间: 2012年12月6日 11:01 收件人: user@hbase.apache.org<mailto:user@hbase.apache.org> 主题: Re: how to store 100billion short text messages with hbase Tian, The best way to think about how to structure your data in HBase is to ask the question: "How will I access it?". Perhaps you could reply with the sorts of queries you expect to be able to do over this data? For example, retrieve any single conversation between two people in < 10 ms; or show all conversations that happened in a single hour, regardless of participants. HBase only gives you fast GET/SCAN access along a single "primary" key (the row key) so you must choose it carefully, or else duplicate & denormalize your data for fast access. Your data size seems reasonable (but not overwhelming) for HBase. 100B messages x 1K bytes per message on average comes out to 100TB. That, plus 3x replication in HDFS, means you need roughly 300TB of space. If you have 13 nodes (taking out 2 for redundant master services) that's a requirement for about 23T of space per server. That's a lot, even these days. Did I get all that math right? On your question about multiple tables: a table in HBase is only a namespace for rowkeys, and a container for a set of regions. If it's a homogenous data set, there's no advantage to breaking the table into multiple tables; that's what regions within the table are for. Ian ps - Please don't cross post to both dev@ and user@. On Dec 5, 2012, at 8:51 PM, tgh wrote: Hi I try to use hbase to store 100billion short texts messages, each message has less than 1000 character and some other items, that is, each messages has less than 10 items, The whole data is a stream for about one year, and I want to create multi tables to store these data, I have two ideas, the one is to store the data in one hour in one table, and for one year data, there are 365*24 tables, the other is to store the date in one day in one table, and for one year , there are 365 tables, And I have about 15 computer nodes to handle these data, and I want to know how to deal with these data, the one for 365*24 tables , or the one for 365 tables, or other better ideas, I am really confused about hbase, it is powerful yet a bit complex for me , is it? Could you give me some advice for hbase data schema and others, Could you help me, Thank you --------------------------------- Tian Guanhua