Corbin, When HBase counter is removed from the data pipeline, this means agent has not changed, but collector changed?
If Yes: It sounds like the problem is in collector writing too slowly to HBase rather than agent sending data too slowly to collector. What does the demux parser look like? Maybe the slow down was occurring in the extraction process? ETL process consumes more cpu resource in collector for HBaseWriter. You might need to make sure demux parser is optimized. If No: Are there more data streams added to Chukwa Agent? What adapters are used to collect data? There are two reasons that we don't split data stream at agent level. 1. Check pointing at agent level is done by stream offset for efficiency. When splitting multiple chunks, and check point every chunk, it will be too much write operation on the source node which can potentially impact source system. 2. For log collection, we want to make sure that we are sending data in sequence order to ensure the log stream is processed in linear fashion. regards, Eric On Thu, Jan 26, 2012 at 1:27 PM, Corbin Hoenes <[email protected]> wrote: > Eric, > > We use chukwa for log aggregation of web servers and it powers our analytics > pipeline. It's been super useful and solid but we are running into a bit of > a problem. I was hoping to split my data stream and create a realtime > pipeline w/hbase but also stream into HDFS for bach MR processing still. > > I am running some simple calculations on pageviews coming in and wanted to > update hbase using counters. This is slow right now since I only really have > 1 servlet processing my chunk in my demo environment. Without the realtime > hbase counters in the pipeline data flows a couple order of magnitudes > quicker--I was hoping that smaller chunks lots more collector servlets I > could make it scale better but right now it slows down the data stream too > much. > > We use only 3 collectors in production and they handle the traffic well... > but adding more would give us more concurrent hbase writer capability, was > hoping there was a knob to allow for more concurrent chunk writing. > > > On Jan 26, 2012, at 1:03 PM, Eric Yang wrote: > >> Hi Corbin, >> >> This is by design. We are concatenating all data streams into in >> memory queue on the agent, and establish only one http connection to >> collector. This is for horizontal scalability that we can support >> more machines. At the same time, it also ensures that agent can write >> more data per HTTP post to reduce overhead of HTTP headers and >> connection handshakes. >> >> regards, >> Eric >> >> On Thu, Jan 26, 2012 at 11:51 AM, Corbin Hoenes <[email protected]> wrote: >>> I am trying to do some real-time processing of the data coming into my >>> chukwa pipeline and notice that using a single agent I don't seem to be >>> getting very many servlets handling the requests. Peeking at the >>> ChukwaAgent code it looks like the agents are limited to a single >>> HttpConnector. >>> >>> Is this by design or am I off-base in my analysis of how it works? >>> >>> >
