Hi, I was working a lot with HiveSink to put the data into Hive, not only I discovered this bug https://issues.apache.org/jira/browse/HIVE-15658 but also I have found that HiveSink differs from HDFSEventSink in the way the thread pool for delayed operations is created.
See this line in HDFSEventSink: https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L522 it uses argument threadsPoolSize which is by default 10 (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L97) but can be configured as hdfs.threadPoolSize in flume config (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L210) To the contrary, HiveSink creates the thread pool this way: https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hive-sink/src/main/java/org/apache/flume/sink/hive/HiveSink.java#L493 1 thread with note // call timeout pool needs only 1 thd as sink is effectively single threaded Why is the Hive sink effectively single threaded? There is no notion of this in documentation (FlumeUserGuide) and how should I handle this situation? For performance reasons, i would like to have multithreaded writeout into Hive, do I have to Multiplex/Round-robin fan-out and configure multiple HiveSinks? Probably I have to, but it is ugly. What is the problem that the HiveSInk is single threaded? Thanks, Michal
