Eric, We use chukwa for log aggregation of web servers and it powers our analytics pipeline. It's been super useful and solid but we are running into a bit of a problem. I was hoping to split my data stream and create a realtime pipeline w/hbase but also stream into HDFS for bach MR processing still.
I am running some simple calculations on pageviews coming in and wanted to update hbase using counters. This is slow right now since I only really have 1 servlet processing my chunk in my demo environment. Without the realtime hbase counters in the pipeline data flows a couple order of magnitudes quicker--I was hoping that smaller chunks lots more collector servlets I could make it scale better but right now it slows down the data stream too much. We use only 3 collectors in production and they handle the traffic well... but adding more would give us more concurrent hbase writer capability, was hoping there was a knob to allow for more concurrent chunk writing. On Jan 26, 2012, at 1:03 PM, Eric Yang wrote: > Hi Corbin, > > This is by design. We are concatenating all data streams into in > memory queue on the agent, and establish only one http connection to > collector. This is for horizontal scalability that we can support > more machines. At the same time, it also ensures that agent can write > more data per HTTP post to reduce overhead of HTTP headers and > connection handshakes. > > regards, > Eric > > On Thu, Jan 26, 2012 at 11:51 AM, Corbin Hoenes <[email protected]> wrote: >> I am trying to do some real-time processing of the data coming into my >> chukwa pipeline and notice that using a single agent I don't seem to be >> getting very many servlets handling the requests. Peeking at the ChukwaAgent >> code it looks like the agents are limited to a single HttpConnector. >> >> Is this by design or am I off-base in my analysis of how it works? >> >>
