Hello, ryan-san Thank you for your kind and polite reply.
I read the Bigtable paper and found two use cases of column timestamp in Google. [use case 1] ================================================== In our Webtable example, we can set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above enables us to tell Bigtable to keep only the most recent three versions of every page. ================================================== [use case 2] ================================================== 8.3 Personalized Search Personalized Search (www.google.com/psearch) is an opt-in service that records user queries and clicks across a variety of Google properties such as web search, images, and news. Users can browse their search histories to revisit their old queries and clicks, and they can ask for personalized search results based on their historical Google usage patterns. Personalized Search stores each user's data in Bigtable. Each user has a unique userid and is assigned a row named by that userid. All user actions are stored in a table. A separate column family is reserved for each type of action (e.g., there is a column family that stores all web queries). Each data element uses as its Bigtable timestamp the time at which the corresponding user action occurred. Personalized Search generates user profiles using a MapReduce over Bigtable. These user profiles are used to personalize live search results. ================================================== In use case 1, I don't understand why three versions of each web page need to be saved, so this is not a helpful example. Use case 2 is interesting. This shows that the column timestamp can be utilized to accumulate the events associated with each subject. However, as you pointed out, this structure has the possibility to lead to big rows. So, this usage pattern is applicable when the number of accumulated events can be limited. When storing machine logs (e.g. CPU load, disk usage, network bandwidth usage), the time of event perhaps should be part of row key (i.e. one row per event). In this sense, timestamp feature is not a necessity for Personalized Search. For example, the data may be structured as follows: row key "<userid>-<time_of_event>" column family "action:" column "action_type" (e.g. click, web search) column "action_data" (e.g. clicked URL, web search query) This structure eliminates the concern about big rows. # I wonder if there is any difference in the simplicity of application code. > The versioning of HBase is integral to the storage mechanism behind it > (and also cassandra and all bigtable like systems). Do you mean that the versioning was invented mainly for the implementation of Bigtable/HBase and not for the users's sake? If the number of maximum versions is set to one when creating tables, is there any bad effects due to the Bigtable/HBase implementation (e.g. performance)? If there is no bad impact, I feel it's better for the default to be one rather than three. And those who want to use versioning should specify maximum versions when creating tables. That reduces the memtable size and disk storage space by storing only one version. Any opinion and information is appreciated. Regards Takayuki ----- Original Message ----- From: "Ryan Rawson" <ryano...@gmail.com> To: <hbase-user@hadoop.apache.org> Sent: Friday, May 07, 2010 11:42 AM Subject: Re: How is column timestamp useful? > Have a look at the bigtable paper, it should help you understand > somewhat why things are the way they are. > > The versioning of HBase is integral to the storage mechanism behind it > (and also cassandra and all bigtable like systems). HBase stores it's > data on HDFS which has immutable files. Thus "overwriting" old values > just does not exist. So a versioning mechanism was introduced (all > part of the original BT paper) to allow you to supersede and delete > (via adding special delete markers) old values. A process known as > 'compaction' removes excessive versions and deleted values - this > compaction is run by default once a day (it is IO intensive). > > If you don't care about timestamp, you can just ignore them and use > HBase like any storage system - with a small caveat: excessive version > creation can cause issues (think hundreds of megs of versions in one > row - a region would end up being 1 row and larger than the max size > for a region and thus un-splittable). So avoid that. > > But other relational databases use versioning, for example the MVCC of > Postgres cause multiple version of a value. Normally this is > completely hidden and is used primarily to implement TX isolation, it > also is operationally exposed to the administrator - the vacuum > command. > > Looking at the wiki entry for Temporal database, I can say that HBase > (and bigtable) are NOT temporal databases by their example. When you > delete a row, it is removed and thus the data goes away. There is a > time component, but I encourage people to think of it as versioning > and backup against application bugs - excessive use of the time > dimension can cause problems (by making a single row larger than the > max size of a region). > > -ryan > > > > 2010/5/6 Takayuki Tsunakawa <tsunakawa.ta...@jp.fujitsu.com>: >> Hello, >> >> I'm new to HBase, so excuse me if I make odd questions. >> >> I'm evaluating HBase from its documentation, and am attracted by its >> broad functionality such as transaction support, secondary index, REST >> API, MapReduce integration, etc. When I recommended HBase to my >> colleagues for the internal project, I was asked a question about how >> column timestamp (version) is useful. They said "One of the good >> things of key-value stores is the simple and flexible data structure. >> But HBase has more structural elements than RDB, column family and >> timestamp, and those additional elements HBase a bit more difficult >> than RDB. I understand the usefulness of column family, however, in >> what situations is timestamp used? Is it really necessary?" >> >> I couldn't answer their question. Then I searched HBase web site, >> HBase user mailing list archive, other web sites with keyword "HBase >> timestamp", and Cassandra's web site for help. But I could not find >> any information about how the column timestamp (versioning) is useful. >> >> Could you tell me in what situations the timestamp is absolutely >> necessary or at least desirable? Some real world examples are much >> appreciated. >> From the search results, many people don't seem to use the timestamp >> feature. However, the default maximum versions for each column is 3. >> If versioning is rarely utilized, doesn't it mean that the storage >> space for extra two versions is wasted and the default should be one? >> Please give me your opinions. >> >> Is HBase's timestamp feature intended for the following "temporal >> database"? If so, how do you structure the Person table in the >> following page? >> >> http://en.wikipedia.org/wiki/Temporal_database >> >> >> Regards