we face the same problem, performance of taking events from channel is a severe bottleneck. When there're less events in channel, problem does not alleviate. following is a log of the metrics of writing to hdfs, writing to 5 files with a batchsize of 200000, take cost the most of the total time.
17 εδΊζ 2013 18:49:28,056 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.HDFSEventSink.process:489) - HdfsSink-TIME-STAT sink[sink_hdfs_b] writers[5] eventcount[200000] all[44513] take[38197] append[5647] sync[17] getFilenameTime[371] On Mon, Nov 25, 2013 at 4:46 PM, Jan Van Besien <[email protected]> wrote: > Hi, > > Is anybody still looking into this question? > > Should I log it in jira such that somebody can look into it later? > > thanks, > Jan > > > > On 11/18/2013 11:28 AM, Jan Van Besien wrote: > > Hi, > > > > Sorry it took me a while to answer this. I compiled a small test case > > using only off the shelve flume components that shows what is going on. > > > > The setup is a single agent with http source, null sink and file > > channel. I am using the default configuration as much as possible. > > > > The test goes as follows: > > > > - start the agent without sink > > - run a script that sends http requests in multiple threads to the http > > source (the script simply calls the url http://localhost:8080/?key=value > > over and over a gain, whereby value is a random string of 100 chars). > > - this script does about 100 requests per second on my machine. I leave > > it running for a while, such that the file channel contains about 20000 > > events. > > - add the null sink to the configuration (around 11:14:33 in the log). > > - observe the logging of the null sink. You'll see in the log file that > > it takes more than 10 seconds per 1000 events (until about even 5000, > > around 11:15:33) > > - stop the http request generating script (i.e. no more writing in file > > channel) > > - observer the logging of the null sink: events 5000 until 20000 are all > > processed within a few seconds. > > > > In the attachment: > > - flume log > > - thread dumps while the ingest was running and the null sink was enabled > > - config (agent1.conf) > > > > I also tried with more sinks (4), see agent2.conf. The results are the > same. > > > > Thanks for looking into this, > > Jan > > > > > > On 11/14/2013 05:08 PM, Brock Noland wrote: > >> On Thu, Nov 14, 2013 at 2:50 AM, Jan Van Besien <[email protected] > >> <mailto:[email protected]>> wrote: > >> > >> On 11/13/2013 03:04 PM, Brock Noland wrote: > >> > The file channel uses a WAL which sits on disk. Each time an > >> event is > >> > committed an fsync is called to ensure that data is durable. > Without > >> > this fsync there is no durability guarantee. More details here: > >> > https://blogs.apache.org/flume/entry/apache_flume_filechannel > >> > >> Yes indeed. I was just not expecting the performance impact to be > >> that big. > >> > >> > >> > The issue is that when the source is committing one-by-one it's > >> > consuming the disk doing an fsync for each event. I would find > a > >> way to > >> > batch up the requests so they are not written one-by-one or use > >> multiple > >> > disks for the file channel. > >> > >> I am already using multiple disks for the channel (4). > >> > >> > >> Can you share your configuration? > >> > >> Batching the > >> requests is indeed what I am doing to prevent the filechannel to > be the > >> bottleneck (using a flume agent with a memory channel in front of > the > >> agent with the file channel), but it inheritely means that I loose > >> end-to-end durability because events are buffered in memory before > being > >> flushed to disk. > >> > >> > >> I would be curious to know though if you doubled the sinks if that would > >> give more time to readers. Could you take three-four thread dumps of the > >> JVM while it's in this state and share them? > >> > > > > -- have a good day! chenshang'an
