I see plenty of value in the HBase approach, but I'm still not clear on how the time and data type partitioning would be done more efficiently within HBase when running a job on a specific 5 minute interval for a given data type. I've only used HBase briefly so I could certainly be missing something, but I thought the sort for range scans is by byte order, which works for string types, but not numbers.
So if your row ids are are <timestamp>/<data_type>, how do you fetch all the data for a given data_type for a given time period without potentially scanning many unnecessary rows? The timestamps will be in alphabetical order, not numeric and data_types would be mixed. Under the current scheme, since partitioning is done in HDFS you could just get <data_type>/<time>/part-* to get exactly the records you're looking for. On Mon, Nov 22, 2010 at 5:00 PM, Eric Yang <[email protected]> wrote: > Comparison chart: > > --------------------------------------------------------------------------- > | Chukwa Types | Chukwa classic | Chukwa on Hbase | > --------------------------------------------------------------------------- > | Installation cost | Hadoop + Chukwa | Hadoop + Hbase + Chukwa | > --------------------------------------------------------------------------- > | Data latency | fixed n Minutes | 50-100 ms | > --------------------------------------------------------------------------- > | File Management | Hourly/Daily Roll Up | Hbase periodically | > | Cost | Mapreduce Job | spill data to disk | > --------------------------------------------------------------------------- > | Record Size | Small needs to fit | Data node block | > | | in java HashMap | size. (64MB) | > --------------------------------------------------------------------------- > | GUI friendly view | Data needs to be | drill down to raw | > | | aggregated first | data or aggregated | > --------------------------------------------------------------------------- > | Demux | Single reducer | Write to hbase in | > | | or creates multiple | parallel | > | | part-nnn files, and | | > | | unsorted between files | | > --------------------------------------------------------------------------- > | Demux Output | Sequence file | Hbase Table | > --------------------------------------------------------------------------- > | Data analytics tools | Mapreduce/Pig | MR/Pig/Hive/Cascading | > --------------------------------------------------------------------------- > > Regards, > Eric > > On 11/22/10 3:05 PM, "Ahmed Fathalla" <[email protected]> wrote: > >> I think what we need to do is create some kind of comparison table >> contrasting the merits of each approach (HBase vs Normal Demux processing). >> This exercise will be both useful in making the decision of choosing the >> default and for documentation purposes to illustrate the difference for new >> users. >> >> >> On Mon, Nov 22, 2010 at 11:19 PM, Bill Graham <[email protected]> wrote: >> >>> We are going to continue to have use cases where we want log data >>> rolled up into 5 minute, hourly and daily increments in HDFS to run >>> map reduce jobs on them. How will this model work with the HBase >>> approach? What process will aggregate the HBase data into time >>> increments like the current demux and hourly/daily rolling processes >>> do? Basically, what does the time partitioning look like in the HBase >>> storage scheme? >>> >>>> My concern is that the demux process is going to become two parallel >>>> tracks, one works in mapreduce, and another one works in collector. It >>>> becomes difficult to have clean efficient parsers which works in both >>> >>> This statement makes me concerned that you're implying the need to >>> deprecate the current demux model, which is very different than making >>> one or the other the default in the configs. Is that the case? >>> >>> >>> >>> On Mon, Nov 22, 2010 at 11:41 AM, Eric Yang <[email protected]> wrote: >>>> MySQL support has been removed from Chukwa 0.5. My concern is that the >>> demux process is going to become two parallel tracks, one works in >>> mapreduce, and another one works in collector. It becomes difficult to have >>> clean efficient parsers which works in both places. From architecture >>> perspective, incremental updates to data is better than batch processing for >>> near real time monitoring purpose. I like to ensure Chukwa framework can >>> deliver Chukwa's mission statement, hence I standby Hbase as default. I was >>> playing with Hbase 0.20.6+Pig 0.8 branch last weekend, I was very impressed >>> by both speed and performance of this combination. I encourage people to >>> try it out. >>>> >>>> Regards, >>>> Eric >>>> >>>> On 11/22/10 10:50 AM, "Ariel Rabkin" <[email protected]> wrote: >>>> >>>> I agree with Bill and Deshpande that we ought to make clear to users >>>> that they don't nee HICC, and therefore don't need either MySQL or >>>> HBase. >>>> >>>> But I think what Eric meant to ask was which of MySQL and HBase ought >>>> to be the default *for HICC*. My sense is that the HBase support >>>> isn't quite mature enough, but it's getting there. >>>> >>>> I think HBase is ultimately the way to go. I think we might benefit as >>>> a community by doing a 0.5 release first, while waiting for the >>>> pig-based aggregation support that's blocking HBase. >>>> >>>> --Ari >>>> >>>> On Mon, Nov 22, 2010 at 10:47 AM, Deshpande, Deepak >>>> <[email protected]> wrote: >>>>> I agree. Making HBase by default would make some Chukwa users life >>> difficult. In my set up, I don't need HDFS. I am using Chukwa merely as a >>> Log Streaming framework. I have plugged in my own writer to write log files >>> in Local File system (instead of HDFS). I evaluated Chukwa with other >>> frameworks and Chukwa had very good fault tolerance built in than other >>> frameworks. This made me recommend Chukwa over other frameworks. >>>>> >>>>> By making HBase default option would definitely make my life difficult >>> :). >>>>> >>>>> Thanks, >>>>> Deepak Deshpande >>>>> >>>> >>>> >>>> -- >>>> Ari Rabkin [email protected] >>>> UC Berkeley Computer Science Department >>>> >>>> >>> >> >> >> >> -- >> Ahmed Fathalla >> > >
