HBase only supports bytes. What to store in the cell, is decided by the demux parser. Chukwa data are currently stored as byte string for the parsers that I implemented. User has full control of data type to store into each HBase column by customize the demux parser.
regards, Eric On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[email protected]> wrote: > Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format > will the hbase data be in (e.g. A chukwarecord object ? Something user > configurable? ) > > Sent from my iPhone > > On Oct 22, 2010, at 8:48 AM, Eric Yang <[email protected]> wrote: > >> Hi Matt, >> >> This is expected in Chukwa archives. When agent is unable to post to >> the collector, it will retry to post the same data again to another >> collector or retrys with the same collector when no other collector is >> available. Collector may have data written without proper acknowledge >> back to agent in high load situation. Chukwa philosophy is to retry >> until receiving acknowledgement. Duplicated data filter will be >> treated after data has been received. >> >> The duplication filtering in Chukwa 0.3.0 depends on data loading to >> mysql. The same primary key will update to the same row to remove >> duplicates. It is possible to build a duplication detection process >> prior to demux which filter data based on sequence id + data type + >> csource (host), but this hasn't been implemented because primary key >> update method works well for my use case. >> >> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, >> where it will replace any duplicated row in HBase base on Timestamp + >> HBase row key. >> >> regards, >> Eric >> >> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[email protected]> wrote: >>> >>> Hey everyone, >>> >>> I have a situation where I'm seeing duplicated data downstream before the >>> demux process. It appears this happens during high system loads and we are >>> still using the 0.3.0 series. >>> >>> So, we have validated that there is a single, unique entry in our source >>> file which then shows up a random amount of times before we see it in demux. >>> So, it appears that there is duplication happening somewhere between the >>> agent and collector. >>> >>> Has anyone else seen this? Any ideas as to why we are seeing this during >>> high system loads, but not during lower loads. >>> >>> TIA, >>> Matt >>> >>> >
