Thanks Matt and Brock. Once characteristics of my application is , it doesn't receive that much data and might take long time to reach the rollsize 64mb. I guess that is causing flume to consume more VM. when I change setting to roll over 10 minutes, VM usage came down. But then smaller size files are being copied to HDFS, which wont work well with Hadoop.
Matt - I tried with lower thread count for AVRO source,and it brought down the VM usage a little bit, but not much. Brock - I get "java.lang.OutOfMemoryError: unable to create new native thread" while flume VM usage is above 16gb and doesn't allow other applications to run. Following setting uses 16.5g vm < a1.sinks.k1.hdfs.txnEventMax = 40000 < a1.sinks.k1.hdfs.rollInterval = 0 < a1.sinks.k1.hdfs.rollSize = 67108864 < a1.sinks.k1.hdfs.rollCount = 1000 < a1.sinks.k1.hdfs.batchSize = 1000 --- Following setting uses 11.5 g VM > #a1.sinks.k1.hdfs.txnEventMax = 40000 > a1.sinks.k1.hdfs.rollInterval = 10 > a1.sinks.k2.hdfs.roundUnit = minute > a1.sinks.k1.hdfs.rollSize = 0 > a1.sinks.k1.hdfs.rollCount = 500 > a1.sinks.k1.hdfs.batchSize = 500 > a1.sinks.k1.hdfs.idleTimeout =0 > a1.sinks.k1.hdfs.maxOpenFiles = 1000 Thanks Shibi From: [email protected] Date: Sat, 14 Dec 2013 10:57:09 -0600 Subject: Re: Flume uses high Virtual memory To: [email protected] Additionally I'd note that worrying about virtual memory on 64 bit machines is probably not worth your time. The newer versions of malloc() do arena allocation and reserve virtual memory for each thread. This does not however, actually consume memory. On Sat, Dec 14, 2013 at 10:49 AM, Matt Wise <[email protected]> wrote: We ran into an issue just like this when we did not limit our source 'thread' counts. The Avro source seems to spawn potentially thousands of threads if you don't limit it: a1.sources.r1.threads = 50 (you can validate this with 'htop') Matt WiseSr. Systems Architect Nextdoor.com On Fri, Dec 13, 2013 at 2:58 PM, shibi S <[email protected]> wrote: Flume Agent that is writing to HDFS is high on virtual memory usage (15.6g). Agent writes to 3 different directories in HDFS based on type of data that is received. Configuration is given below. Any idea why VM usage is high? I see high VM usage only on the Agents that is writing to HDFS. Other Agents are low in VM usage. Flume version : apache-flume-1.4.0 (I tested with 1.5 version as well). PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 38663 deploy 20 0 15.6g 576m 15m S 2.6 0.2 225:19.29 java Configuration: a1.sources.r1.selector.type = multiplexing a1.sources.r1.selector.header = header1 a1.sources.r1.selector.mapping.red_cancel = c1 Source Configuration: a1.sources.r1.type = avro a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port = 60000 Sink configuration: a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=hdfs://<HDFS PATH>/%Y/%m/%d/%H a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.filePrefix = filetype1- a1.sinks.k1.hdfs.useLocalTimeStamp = true #a1.sinks.k1.hdfs.txnEventMax = 40000 a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k2.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 500 a1.sinks.k1.hdfs.batchSize = 500 a1.sinks.k1.hdfs.idleTimeout =0 a1.sinks.k1.hdfs.maxOpenFiles = 1000 Channel configuration: a1.channels.c2.type=file a1.channels.c2.checkpointDir =/x/home/deploy/flume/checkpoint2 a1.channels.c2.dataDirs = /x/home/deploy/flume/data2 -- Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
