Re: HiveSink not multithreaded?

Roshan Naik Thu, 26 Jan 2017 12:19:32 -0800

  

The basic design of Flume is that sinks and sources are logically
single-threaded. If users want concurrent instances, they should configure
multiple sources/sinks. This applies to both Hdfs and Hive Sink. See below
for details ...



Hdfs Sink: 
  - The callTimeoutPool is only used to put a timeout on blocking
operation.
  - The sink's process() method calls bucketWriter.append() when then puts
the IO operation on the callTimeoutPool to avoid blocking indefinitely.
  - BucketWriter.append returns only when the blocking operation is
complete or timeout kicks in
  - Although there can be multiple bucket writers existing (one for each
open file), only one bucketWriter.append() is active at anytime.
  - So unclear why it needs more than 1 thread in that timeout pool.
 


Hive Sink creates the same thread pool with 1 thread.
  - 
https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hive-sink/s
rc/main/java/org/apache/flume/sink/hive/HiveSink.java#L493



In terms of concurrency, Hdfs and Hive sinks are pretty similar in their
behavior.

For your case, just adding multiple Hive sinks with the identical settings
should suffice... if you are writing to same table/partition.
Your config doesn¹t need to get fancy with sink processors etc.

If you are shooting to get more perf, you can find some perf numbers
around Hive sink here:

https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+
-+round+2


-Roshan



On 1/25/17, 11:17 PM, "Michal Klempa" <[email protected]> wrote:

>Hi,
>I was working a lot with HiveSink to put the data into Hive, not only
>I discovered this bug
>https://issues.apache.org/jira/browse/HIVE-15658
>but also I have found that HiveSink differs from HDFSEventSink in the
>way the thread pool for
>delayed operations is created.
>
>See this line in HDFSEventSink:
>https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/
>src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L522
>it uses argument threadsPoolSize which is by default 10
>(https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink
>/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L97)
>but can be configured as hdfs.threadPoolSize in flume config
>(https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink
>/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L210)
>
>To the contrary, HiveSink creates the thread pool this way:
>https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hive-sink/
>src/main/java/org/apache/flume/sink/hive/HiveSink.java#L493
>1 thread with note // call timeout pool needs only 1 thd as sink is
>effectively single threaded
>
>Why is the Hive sink effectively single threaded? There is no notion
>of this in documentation (FlumeUserGuide) and how should I handle this
>situation? For performance reasons, i would like to have multithreaded
>writeout into Hive, do I have to Multiplex/Round-robin fan-out and
>configure multiple HiveSinks? Probably I have to, but it is ugly.
>
>What is the problem that the HiveSInk is single threaded?
>
>Thanks, Michal
>

Re: HiveSink not multithreaded?

Reply via email to