Hbase makes life easier with file management on HDFS. Hbase roll up the data into large file sets which is more efficient for scanning and random access. HBase supports mapreduce on table instead of on files. Therefore, data analytics on hbase is a great improvement and no drawback. The data analytics jobs continue to run every n minutes interval, but you don't need to wait 5 minutes for data to arrive in order to start data processing.
Another eliminated limitation was in daily rolling and hourly rolling. Chukwa used to produce files periodically, and those files need to be roll up into bigger files and regular append doesn't work because late arrival data needs to be resorted in the sequence file. Hence, we run hourly and daily job which does purely sorting and merging data. This is somewhat wasteful of burning cpu cycles without actual good benefits. Data looks like this in Chukwa Record: Time Partition/Primary Key/Actual Timestamp - [small hashmap] Data looks like this in Hbase: Timestamp/Primary Key - [big hashmap] Therefore, it's identical, the only difference is scan for data is a lot faster and not burn cpu cycle for sorting/merging data. Hbase handles the merging and indexing of data much more elegantly. We don't need to make data into different partitions because hbase handles this for us. We can continue to insert data and hbase regional server will partition the data for us and provide fast scanning. If the number of records is beyond trillions, it is still possible to partition table name by date, if user choose to do this. Bill, you are reading my mind. I also imply to deprecate the current hybrid model, and make a cleaner solution that work in the collector. It would be easier for new comer to adopt. Regards, Eric On 11/22/10 1:19 PM, "Bill Graham" <[email protected]> wrote: We are going to continue to have use cases where we want log data rolled up into 5 minute, hourly and daily increments in HDFS to run map reduce jobs on them. How will this model work with the HBase approach? What process will aggregate the HBase data into time increments like the current demux and hourly/daily rolling processes do? Basically, what does the time partitioning look like in the HBase storage scheme? > My concern is that the demux process is going to become two parallel > tracks, one works in mapreduce, and another one works in collector. It > becomes difficult to have clean efficient parsers which works in both This statement makes me concerned that you're implying the need to deprecate the current demux model, which is very different than making one or the other the default in the configs. Is that the case? On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[email protected]> wrote: > MySQL support has been removed from Chukwa 0.5. My concern is that the demux > process is going to become two parallel tracks, one works in mapreduce, and > another one works in collector. It becomes difficult to have clean efficient > parsers which works in both places. From architecture perspective, > incremental updates to data is better than batch processing for near real > time monitoring purpose. I like to ensure Chukwa framework can deliver > Chukwa's mission statement, hence I standby Hbase as default. I was playing > with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed by both > speed and performance of this combination. I encourage people to try it out. > > Regards, > Eric > > On 11/22/10 10:50 AM, "Ariel Rabkin" <[email protected]> wrote: > > I agree with Bill and Deshpande that we ought to make clear to users > that they don't nee HICC, and therefore don't need either MySQL or > HBase. > > But I think what Eric meant to ask was which of MySQL and HBase ought > to be the default *for HICC*. My sense is that the HBase support > isn't quite mature enough, but it's getting there. > > I think HBase is ultimately the way to go. I think we might benefit as > a community by doing a 0.5 release first, while waiting for the > pig-based aggregation support that's blocking HBase. > > --Ari > > On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak > <[email protected]> wrote: >> I agree. Making HBase by default would make some Chukwa users life >> difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a >> Log Streaming framework. I have plugged in my own writer to write log files >> in Local File system (instead of HDFS). I evaluated Chukwa with other >> frameworks and Chukwa had very good fault tolerance built in than other >> frameworks. This made me recommend Chukwa over other frameworks. >> >> By making HBase default option would definitely make my life difficult :). >> >> Thanks, >> Deepak Deshpande >> > > > -- > Ari Rabkin [email protected] > UC Berkeley Computer Science Department > >
