The basic design of Flume is that sinks and sources are logically single-threaded. If users want concurrent instances, they should configure multiple sources/sinks. This applies to both Hdfs and Hive Sink. See below for details ...
Hdfs Sink: - The callTimeoutPool is only used to put a timeout on blocking operation. - The sink's process() method calls bucketWriter.append() when then puts the IO operation on the callTimeoutPool to avoid blocking indefinitely. - BucketWriter.append returns only when the blocking operation is complete or timeout kicks in - Although there can be multiple bucket writers existing (one for each open file), only one bucketWriter.append() is active at anytime. - So unclear why it needs more than 1 thread in that timeout pool. Hive Sink creates the same thread pool with 1 thread. - https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hive-sink/s rc/main/java/org/apache/flume/sink/hive/HiveSink.java#L493 In terms of concurrency, Hdfs and Hive sinks are pretty similar in their behavior. For your case, just adding multiple Hive sinks with the identical settings should suffice... if you are writing to same table/partition. Your config doesn¹t need to get fancy with sink processors etc. If you are shooting to get more perf, you can find some perf numbers around Hive sink here: https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+ -+round+2 -Roshan On 1/25/17, 11:17 PM, "Michal Klempa" <[email protected]> wrote: >Hi, >I was working a lot with HiveSink to put the data into Hive, not only >I discovered this bug >https://issues.apache.org/jira/browse/HIVE-15658 >but also I have found that HiveSink differs from HDFSEventSink in the >way the thread pool for >delayed operations is created. > >See this line in HDFSEventSink: >https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/ >src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L522 >it uses argument threadsPoolSize which is by default 10 >(https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink >/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L97) >but can be configured as hdfs.threadPoolSize in flume config >(https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink >/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L210) > >To the contrary, HiveSink creates the thread pool this way: >https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hive-sink/ >src/main/java/org/apache/flume/sink/hive/HiveSink.java#L493 >1 thread with note // call timeout pool needs only 1 thd as sink is >effectively single threaded > >Why is the Hive sink effectively single threaded? There is no notion >of this in documentation (FlumeUserGuide) and how should I handle this >situation? For performance reasons, i would like to have multithreaded >writeout into Hive, do I have to Multiplex/Round-robin fan-out and >configure multiple HiveSinks? Probably I have to, but it is ugly. > >What is the problem that the HiveSInk is single threaded? > >Thanks, Michal >
