Re: [DISCUSSION] Making HBaseWriter default

Eric Yang Mon, 22 Nov 2010 16:24:00 -0800

Hbase makes life easier with file management on HDFS.  Hbase roll up the data 
into large file sets which is more efficient for scanning and random access. 
HBase supports mapreduce on table instead of on files.  Therefore, data 
analytics on hbase is a great improvement and no drawback.  The data analytics 
jobs continue to run every n minutes interval, but you don't need to wait 5 
minutes for data to arrive in order to start data processing.

Another eliminated limitation was in daily rolling and hourly rolling.  Chukwa 
used to produce files periodically, and those files need to be roll up into 
bigger files and regular append doesn't work because late arrival data needs to 
be resorted in the sequence file.  Hence, we run hourly and daily job which 
does purely sorting and merging data.  This is somewhat wasteful of burning cpu 
cycles without actual good benefits.

Data looks like this in Chukwa Record:
Time Partition/Primary Key/Actual Timestamp - [small hashmap]

Data looks like this in Hbase:
Timestamp/Primary Key - [big hashmap]

Therefore, it's identical, the only difference is scan for data is a lot faster 
and not burn cpu cycle for sorting/merging data.  Hbase handles the merging and 
indexing of data much more elegantly.

We don't need to make data into different partitions because hbase handles this 
for us.  We can  continue to insert data and hbase regional server will 
partition the data for us and provide fast scanning.  If the number of records 
is beyond trillions, it is still possible to partition table name by date, if 
user choose to do this.

Bill, you are reading my mind.  I also imply to deprecate the current hybrid 
model, and make a cleaner solution that work in the collector.  It would be 
easier for new comer to adopt.

Regards,
Eric

On 11/22/10 1:19 PM, "Bill Graham" <[email protected]> wrote:

We are going to continue to have use cases where we want log data
rolled up into 5 minute, hourly and daily increments in HDFS to run
map reduce jobs on them. How will this model work with the HBase
approach? What process will aggregate the HBase data into time
increments like the current demux and hourly/daily rolling processes
do? Basically, what does the time partitioning look like in the HBase
storage scheme?

> My concern is that the demux process is going to become two parallel
> tracks, one works in mapreduce, and another one works in collector.  It
> becomes difficult to have clean efficient parsers which works in both

This statement makes me concerned that you're implying the need to
deprecate the current demux model, which is very different than making
one or the other the default in the configs. Is that the case?

On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[email protected]> wrote:
> MySQL support has been removed from Chukwa 0.5.  My concern is that the demux 
> process is going to become two parallel tracks, one works in mapreduce, and 
> another one works in collector.  It becomes difficult to have clean efficient 
> parsers which works in both places.  From architecture perspective, 
> incremental updates to data is better than batch processing for near real 
> time monitoring purpose.  I like to ensure Chukwa framework can deliver 
> Chukwa's mission statement, hence I standby Hbase as default.  I was playing 
> with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed by both 
> speed and performance of this combination.  I encourage people to try it out.
>
> Regards,
> Eric
>
> On 11/22/10 10:50 AM, "Ariel Rabkin" <[email protected]> wrote:
>
> I agree with Bill and Deshpande that we ought to make clear to users
> that they don't nee HICC, and therefore don't need either MySQL or
> HBase.
>
> But I think what Eric meant to ask was which of MySQL and HBase ought
> to be the default *for HICC*.  My sense is that the HBase support
> isn't quite mature enough, but it's getting there.
>
> I think HBase is ultimately the way to go. I think we might benefit as
> a community by doing a 0.5 release first, while waiting for the
> pig-based aggregation support that's blocking HBase.
>
> --Ari
>
> On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak
> <[email protected]> wrote:
>> I agree. Making HBase by default would make some Chukwa users life 
>> difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a 
>> Log Streaming framework. I have plugged in my own writer to write log files 
>> in Local File system (instead of HDFS). I evaluated Chukwa with other 
>> frameworks and Chukwa had very good fault tolerance built in than other 
>> frameworks. This made me recommend Chukwa over other frameworks.
>>
>> By making HBase default option would definitely make my life difficult :).
>>
>> Thanks,
>> Deepak Deshpande
>>
>
>
> --
> Ari Rabkin [email protected]
> UC Berkeley Computer Science Department
>
>

Re: [DISCUSSION] Making HBaseWriter default

Reply via email to