Hi Boris,

There is a PutKudu Nifi processor and some work is being done to ensure
it's complete.

https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-kudu-bundle/nifi-kudu-processors/src/main/java/org/apache/nifi/processors/kudu

Are you using that? If not maybe you could, or at least use it for
inspiration.

Thanks,
Grant

On Thu, Feb 14, 2019 at 4:05 PM Jean-Daniel Cryans <jdcry...@apache.org>
wrote:

> One order of magnitude improvement is great news!
>
> There's something to learn here... Maybe the masters should be screaming
> if they're being slammed with tablet location requests (which I'm guessing
> was your case). Or should we have some recipes like "here's how you should
> write to Kudu from Nifi"? Any thoughts?
>
> In any case, thanks for reporting back!
>
> J-D
>
> On Thu, Feb 14, 2019 at 1:56 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi J-D, we just solved our performance issue! client reuse did the trick.
>> We knew it was something we had to do but without going into details, it
>> was not quite an easy thing to do because we use NiFi to coordinate stuff
>> and could not find a way to reuse Kudu client safely. Data comes in mini
>> batches (200 operations on average) and for every mini batch we would open
>> a new client, new session, then close. We also run them in the fastest mode
>> (autoflush background) but if we have any errors, we do another pass in
>> AUTO_FLUSH_SYNC mode because of KUDU-2625 as we need to implement some
>> custom error handling logic (some errors in our case can be safely ignored
>> while others are not).
>>
>> Long story short, we found a way to reuse client instance in NiFi while
>> still keeping native concurrency benefits of NiFi and our performance
>> improved by 10 times at least!
>>
>> Thanks for your help and ideas! It is quite a relief for us!
>> Boris
>>
>> On Thu, Feb 14, 2019 at 11:51 AM Jean-Daniel Cryans <jdcry...@apache.org>
>> wrote:
>>
>>> Hi Boris,
>>>
>>> Thank you for all those details. Some questions:
>>>
>>> - Is test 3 right for the tserver threads/queue? Or did you really just
>>> bump the master's threads/queue and it magically made it ~2.3 times faster?
>>> And then in test 5 you also bumped the server and it barely made it better?
>>> Unless I'm missing something big, this is very unexpected. Are you
>>> re-creating the Kudu Client each time you send operations? Otherwise, the
>>> tablet locations should stay in cache and the masters would be completely
>>> out of the picture after the first few seconds.
>>>
>>> - Are you batching the writes? What kind of flush mode are you using?
>>>
>>> - Are there some server-side graphs you can share? I'd like to see
>>> inserts/second across the servers for the duration of the tests.
>>>
>>> - Can you share your table schema and partitions schema? For the columns
>>> I'm mostly interested in the row keys and the cardinality of each column.
>>>
>>> Thanks,
>>>
>>> J-D
>>>
>>> On Thu, Feb 14, 2019 at 5:41 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi J-D and thanks for your comments.
>>>>
>>>> On a very high level, we subscribe to 750 Kafka topics and wrote a
>>>> custom app using Java API to insert, update or delete into 250 Kudu tables
>>>> (messages from 3 topics are merged into a single Kudu table). Our custom
>>>> code can spawn any number of threads and we experimented with 10,20 and 50.
>>>>
>>>> When we did our first test on the development 6 node cluster
>>>> (high-density 2x22 core beast) for 5 tables/15 topics with 10 concurrent
>>>> threads  - all was good and promising.
>>>> Once we've added all tables/topics, our process became very slow and
>>>> throughput dropped by 20-30 times. We increased the number of threads for
>>>> our custom code to 50 and this is when we noticed that Kudu uses only 10
>>>> threads and other threads are waiting in the queue.
>>>>
>>>> Out target Kudu tables were empty when we started and the cluster was
>>>> pretty much idle.
>>>>
>>>> Here are the results of our benchmarks that you might find interesting.
>>>>
>>>> “threads/queue” in the header is rpc_service_threads ,
>>>> rpc_service_queue_length.
>>>>
>>>> we started with defaults and test 5 and 6 turned out to be the best.
>>>> Quite a dramatic difference with test 1.
>>>>
>>>> I should also mention, we are running this on our DEV 6 node cluster
>>>> but it is pretty beefy (2x24 cpu core, 256Gb of Ram, 12 disks etc.) and the
>>>> cluster was not doing anything else but only writing into Kudu.
>>>>
>>>> It is also interesting that test 7 did not give any further
>>>> improvements - our speculation here is that we just hit the limits of our 6
>>>> node Kudu cluster, since it can handle so many tablets at once and we use a
>>>> replication factor of 3.
>>>>
>>>> Another test we did, we created a simple app that would run selects on
>>>> these tables, while the first app keeps writing into these tables.
>>>> Throughput dropped quite a bit as well with the defaults but bumping rpc
>>>> threads helped.
>>>>
>>>> If you have any other thoughts/observations, I would like to hear them!
>>>>
>>>> I think things like that should be somewhere in the Kudu doc along with
>>>> a few important parameters that new orgs to Kudu must tweak. I can write a
>>>> blog post about it, but I am no Kudu dev so do not want to represent
>>>> anything.
>>>>
>>>> For example, we've learned the hard way to tweak these two parameters
>>>> right away as Insert performance was terrible out of the box:
>>>>
>>>> [image: Machine generated alternative text: Will Berkeley 4:40 AM
>>>> @wangxg you shouldn't need to tune too many parameters despite the large
>>>> number of available ones tune --memory_limit_hard_bytes to control the
>>>> total amount of memory kudu will use tune --maintenance_manager_num_threads
>>>> to about 1/3 of the number of disks you are using for kudu (assuming you
>>>> are using the latest version, 1.5)]
>>>>
>>>>
>>>>   custom app threads Master Tablet Total Operations# Total Duration
>>>> from start to finish Avg Operations # per second Avg duration ms per
>>>> flowfile Avg # of operations per flowfile
>>>> threads,queue threads,queue
>>>> test 1 10 10,50 10,50 5,906,849 60 minutes 98,447 404.389279 ms
>>>> 66.8861423
>>>> test 2 30 10,50 10,50 3,107,938 32 minutes 97,123 1611.71499 ms
>>>> 102.416727
>>>> test 3 10 30,100 10,50 13,617,274 60 minutes 226,954 448.954472 ms
>>>> 196.878148
>>>> test 4 30 30,100 10,50 5,794,268 60 minutes 96,571 2342.55094 ms
>>>> 114.822107
>>>> test 5 10 30,100 30,100 16,813,710 60 minutes 280,228 391.113644 ms
>>>> 183.887024
>>>> test 6 30 30,100 30,100 15,903,303 60 minutes 265,055 2300.38629 ms
>>>> 341.316543
>>>> test 7 30 50,200 50,200 12,549,114 60 minutes 209,151 2364.45707 ms
>>>> 276.851262
>>>>
>>>> On Wed, Feb 13, 2019 at 11:39 AM Jean-Daniel Cryans <
>>>> jdcry...@apache.org> wrote:
>>>>
>>>>> Some comments on the original problem: "we need to process 1000s of
>>>>> operations per second and noticed that our Kudu 1.5 cluster was only using
>>>>> 10 threads while our application spins up 50 clients/threads"
>>>>>
>>>>> I wouldn't directly infer that 20 threads won't be enough to match
>>>>> your needs. The time it takes to service a request can vary greatly, a
>>>>> single thread could process 500 operations that take 2ms to run, or 2 that
>>>>> take 500ms to run, and you have 20 of those. The queue is there to make
>>>>> sure that the threads are kept busy instead of bouncing the clients back
>>>>> the moment all the threads are occupied. Your 50 threads can't constantly
>>>>> pound all the tservers, there's time spent on the network and whatever
>>>>> processing needs to happen client-side before they go back to Kudu.
>>>>>
>>>>> TBH there's not a whole lot of science around how we set those two
>>>>> defaults (# of threads and queue size), but it's very workload-dependent.
>>>>> Ideally the tservers would just right-size the pools based on the kind of
>>>>> requests that are coming in and the amount of memory it can use. I guess
>>>>> CPU also comes in the picture but again it depends on the workload, Kudu
>>>>> stores data so it tends to be IO-bound more than CPU-bound.
>>>>>
>>>>> But the memory concern is very real. To be put in the queue the
>>>>> requests must be read from the network, so it doesn't take that many 2MB
>>>>> batches of inserts to occupy a lot of memory. Scans, on the other hand,
>>>>> become a memory concern in the threads because that's where they
>>>>> materialize data in memory and, depending on the number of columns scanned
>>>>> and the kind of data that's read, it could be a lot. That's why the
>>>>> defaults aren't arbitrarily high, they're more on the safe side.
>>>>>
>>>>> Have you actually encountered performance issues that you could trace
>>>>> back to this?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> J-D
>>>>>
>>>>> On Wed, Feb 13, 2019 at 3:49 AM Boris <boris...@gmail.com> wrote:
>>>>>
>>>>>> But if we bump threads count to 50, and queue default is 50, we
>>>>>> should probably bump queue to 100 or something like that, right?
>>>>>>
>>>>>> On Wed, Feb 13, 2019, 00:54 Hao Hao <hao....@cloudera.com wrote:
>>>>>>
>>>>>>> I don't see other flags that are relevant here, maybe others can chime
>>>>>>> in.
>>>>>>>
>>>>>>> For --rpc_service_queue_length, it configs the size of the RPC request
>>>>>>> queues. The queue helps to buffer requests in case if there is a bunch 
>>>>>>> of
>>>>>>> them coming at once and service threads are too busy processing
>>>>>>> already arrived requests. But I don't see it can help with handling more
>>>>>>> concurrent requests.
>>>>>>>
>>>>>>> Best,
>>>>>>> Hao
>>>>>>>
>>>>>>> On Tue, Feb 12, 2019 at 6:45 PM Boris <boris...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Hao, appreciate your response.
>>>>>>>>
>>>>>>>> Do we also need to bump other RPC thread related parameters queue
>>>>>>>> etc.?
>>>>>>>>
>>>>>>>> On Tue, Feb 12, 2019, 21:09 Hao Hao <hao....@cloudera.com wrote:
>>>>>>>>
>>>>>>>>> Hi Boris,
>>>>>>>>>
>>>>>>>>> Sorry for the delay,  --rpc_num_service_threads sets the number of
>>>>>>>>> threads in RPC service thread pool (the default is 20 for tablet
>>>>>>>>> server, 10 for master).  It should help with processing concurrent 
>>>>>>>>> incoming
>>>>>>>>> RPC requests, but increasing it more than the number of available CPU 
>>>>>>>>> cores
>>>>>>>>> of the machines may not bring too much value.
>>>>>>>>>
>>>>>>>>> You don't need to set the same value for masters and tablet
>>>>>>>>> servers. Most of the time, tablet servers should have more RPCs where 
>>>>>>>>> the
>>>>>>>>> scans and writes are taking place.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Hao
>>>>>>>>>
>>>>>>>>> On Tue, Feb 12, 2019 at 5:29 PM Boris Tyukin <
>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> Can someone point us to documentation or explain what these
>>>>>>>>>> parameters really mean or how they should be set on production 
>>>>>>>>>> cluster?
>>>>>>>>>> I will greatly appreciate it!
>>>>>>>>>>
>>>>>>>>>> Boris
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 8, 2019 at 3:40 PM Boris Tyukin <
>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> we need to process 1000s of operations per second and noticed
>>>>>>>>>>> that our Kudu 1.5 cluster was only using 10 threads while our 
>>>>>>>>>>> application
>>>>>>>>>>> spins up 50 clients/threads. We observed in the web UI that only 10 
>>>>>>>>>>> threads
>>>>>>>>>>> are working and other 40 waiting in the queue.
>>>>>>>>>>>
>>>>>>>>>>> We found rpc_num_service_threads parameter in the configuration
>>>>>>>>>>> guide but it is still not clear to me what we need to adjust 
>>>>>>>>>>> exactly to
>>>>>>>>>>> allow Kudu to handle more concurrent operations.
>>>>>>>>>>>
>>>>>>>>>>> Do we bump this parameter below or we need to consider other
>>>>>>>>>>> rpc related parameters?
>>>>>>>>>>>
>>>>>>>>>>> Also do we need to use the same numbers for Masters and tablets?
>>>>>>>>>>>
>>>>>>>>>>> Is there any good numbers to target based on CPU core count?
>>>>>>>>>>>
>>>>>>>>>>> --rpc_num_service_threads
>>>>>>>>>>> <https://kudu.apache.org/docs/configuration_reference.html#kudu-master_rpc_num_service_threads>
>>>>>>>>>>> <https://kudu.apache.org/docs/configuration_reference.html#kudu-master_rpc_num_service_threads>
>>>>>>>>>>>
>>>>>>>>>>> Number of RPC worker threads to run
>>>>>>>>>>>
>>>>>>>>>>> Type
>>>>>>>>>>>
>>>>>>>>>>> int32
>>>>>>>>>>>
>>>>>>>>>>> Default
>>>>>>>>>>>
>>>>>>>>>>> 10
>>>>>>>>>>>
>>>>>>>>>>> Tags
>>>>>>>>>>>
>>>>>>>>>>> advanced
>>>>>>>>>>>
>>>>>>>>>>

-- 
Grant Henke
Software Engineer | Cloudera
gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Reply via email to