Hi Boris, There is a PutKudu Nifi processor and some work is being done to ensure it's complete.
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-kudu-bundle/nifi-kudu-processors/src/main/java/org/apache/nifi/processors/kudu Are you using that? If not maybe you could, or at least use it for inspiration. Thanks, Grant On Thu, Feb 14, 2019 at 4:05 PM Jean-Daniel Cryans <jdcry...@apache.org> wrote: > One order of magnitude improvement is great news! > > There's something to learn here... Maybe the masters should be screaming > if they're being slammed with tablet location requests (which I'm guessing > was your case). Or should we have some recipes like "here's how you should > write to Kudu from Nifi"? Any thoughts? > > In any case, thanks for reporting back! > > J-D > > On Thu, Feb 14, 2019 at 1:56 PM Boris Tyukin <bo...@boristyukin.com> > wrote: > >> Hi J-D, we just solved our performance issue! client reuse did the trick. >> We knew it was something we had to do but without going into details, it >> was not quite an easy thing to do because we use NiFi to coordinate stuff >> and could not find a way to reuse Kudu client safely. Data comes in mini >> batches (200 operations on average) and for every mini batch we would open >> a new client, new session, then close. We also run them in the fastest mode >> (autoflush background) but if we have any errors, we do another pass in >> AUTO_FLUSH_SYNC mode because of KUDU-2625 as we need to implement some >> custom error handling logic (some errors in our case can be safely ignored >> while others are not). >> >> Long story short, we found a way to reuse client instance in NiFi while >> still keeping native concurrency benefits of NiFi and our performance >> improved by 10 times at least! >> >> Thanks for your help and ideas! It is quite a relief for us! >> Boris >> >> On Thu, Feb 14, 2019 at 11:51 AM Jean-Daniel Cryans <jdcry...@apache.org> >> wrote: >> >>> Hi Boris, >>> >>> Thank you for all those details. Some questions: >>> >>> - Is test 3 right for the tserver threads/queue? Or did you really just >>> bump the master's threads/queue and it magically made it ~2.3 times faster? >>> And then in test 5 you also bumped the server and it barely made it better? >>> Unless I'm missing something big, this is very unexpected. Are you >>> re-creating the Kudu Client each time you send operations? Otherwise, the >>> tablet locations should stay in cache and the masters would be completely >>> out of the picture after the first few seconds. >>> >>> - Are you batching the writes? What kind of flush mode are you using? >>> >>> - Are there some server-side graphs you can share? I'd like to see >>> inserts/second across the servers for the duration of the tests. >>> >>> - Can you share your table schema and partitions schema? For the columns >>> I'm mostly interested in the row keys and the cardinality of each column. >>> >>> Thanks, >>> >>> J-D >>> >>> On Thu, Feb 14, 2019 at 5:41 AM Boris Tyukin <bo...@boristyukin.com> >>> wrote: >>> >>>> Hi J-D and thanks for your comments. >>>> >>>> On a very high level, we subscribe to 750 Kafka topics and wrote a >>>> custom app using Java API to insert, update or delete into 250 Kudu tables >>>> (messages from 3 topics are merged into a single Kudu table). Our custom >>>> code can spawn any number of threads and we experimented with 10,20 and 50. >>>> >>>> When we did our first test on the development 6 node cluster >>>> (high-density 2x22 core beast) for 5 tables/15 topics with 10 concurrent >>>> threads - all was good and promising. >>>> Once we've added all tables/topics, our process became very slow and >>>> throughput dropped by 20-30 times. We increased the number of threads for >>>> our custom code to 50 and this is when we noticed that Kudu uses only 10 >>>> threads and other threads are waiting in the queue. >>>> >>>> Out target Kudu tables were empty when we started and the cluster was >>>> pretty much idle. >>>> >>>> Here are the results of our benchmarks that you might find interesting. >>>> >>>> “threads/queue” in the header is rpc_service_threads , >>>> rpc_service_queue_length. >>>> >>>> we started with defaults and test 5 and 6 turned out to be the best. >>>> Quite a dramatic difference with test 1. >>>> >>>> I should also mention, we are running this on our DEV 6 node cluster >>>> but it is pretty beefy (2x24 cpu core, 256Gb of Ram, 12 disks etc.) and the >>>> cluster was not doing anything else but only writing into Kudu. >>>> >>>> It is also interesting that test 7 did not give any further >>>> improvements - our speculation here is that we just hit the limits of our 6 >>>> node Kudu cluster, since it can handle so many tablets at once and we use a >>>> replication factor of 3. >>>> >>>> Another test we did, we created a simple app that would run selects on >>>> these tables, while the first app keeps writing into these tables. >>>> Throughput dropped quite a bit as well with the defaults but bumping rpc >>>> threads helped. >>>> >>>> If you have any other thoughts/observations, I would like to hear them! >>>> >>>> I think things like that should be somewhere in the Kudu doc along with >>>> a few important parameters that new orgs to Kudu must tweak. I can write a >>>> blog post about it, but I am no Kudu dev so do not want to represent >>>> anything. >>>> >>>> For example, we've learned the hard way to tweak these two parameters >>>> right away as Insert performance was terrible out of the box: >>>> >>>> [image: Machine generated alternative text: Will Berkeley 4:40 AM >>>> @wangxg you shouldn't need to tune too many parameters despite the large >>>> number of available ones tune --memory_limit_hard_bytes to control the >>>> total amount of memory kudu will use tune --maintenance_manager_num_threads >>>> to about 1/3 of the number of disks you are using for kudu (assuming you >>>> are using the latest version, 1.5)] >>>> >>>> >>>> custom app threads Master Tablet Total Operations# Total Duration >>>> from start to finish Avg Operations # per second Avg duration ms per >>>> flowfile Avg # of operations per flowfile >>>> threads,queue threads,queue >>>> test 1 10 10,50 10,50 5,906,849 60 minutes 98,447 404.389279 ms >>>> 66.8861423 >>>> test 2 30 10,50 10,50 3,107,938 32 minutes 97,123 1611.71499 ms >>>> 102.416727 >>>> test 3 10 30,100 10,50 13,617,274 60 minutes 226,954 448.954472 ms >>>> 196.878148 >>>> test 4 30 30,100 10,50 5,794,268 60 minutes 96,571 2342.55094 ms >>>> 114.822107 >>>> test 5 10 30,100 30,100 16,813,710 60 minutes 280,228 391.113644 ms >>>> 183.887024 >>>> test 6 30 30,100 30,100 15,903,303 60 minutes 265,055 2300.38629 ms >>>> 341.316543 >>>> test 7 30 50,200 50,200 12,549,114 60 minutes 209,151 2364.45707 ms >>>> 276.851262 >>>> >>>> On Wed, Feb 13, 2019 at 11:39 AM Jean-Daniel Cryans < >>>> jdcry...@apache.org> wrote: >>>> >>>>> Some comments on the original problem: "we need to process 1000s of >>>>> operations per second and noticed that our Kudu 1.5 cluster was only using >>>>> 10 threads while our application spins up 50 clients/threads" >>>>> >>>>> I wouldn't directly infer that 20 threads won't be enough to match >>>>> your needs. The time it takes to service a request can vary greatly, a >>>>> single thread could process 500 operations that take 2ms to run, or 2 that >>>>> take 500ms to run, and you have 20 of those. The queue is there to make >>>>> sure that the threads are kept busy instead of bouncing the clients back >>>>> the moment all the threads are occupied. Your 50 threads can't constantly >>>>> pound all the tservers, there's time spent on the network and whatever >>>>> processing needs to happen client-side before they go back to Kudu. >>>>> >>>>> TBH there's not a whole lot of science around how we set those two >>>>> defaults (# of threads and queue size), but it's very workload-dependent. >>>>> Ideally the tservers would just right-size the pools based on the kind of >>>>> requests that are coming in and the amount of memory it can use. I guess >>>>> CPU also comes in the picture but again it depends on the workload, Kudu >>>>> stores data so it tends to be IO-bound more than CPU-bound. >>>>> >>>>> But the memory concern is very real. To be put in the queue the >>>>> requests must be read from the network, so it doesn't take that many 2MB >>>>> batches of inserts to occupy a lot of memory. Scans, on the other hand, >>>>> become a memory concern in the threads because that's where they >>>>> materialize data in memory and, depending on the number of columns scanned >>>>> and the kind of data that's read, it could be a lot. That's why the >>>>> defaults aren't arbitrarily high, they're more on the safe side. >>>>> >>>>> Have you actually encountered performance issues that you could trace >>>>> back to this? >>>>> >>>>> Thanks, >>>>> >>>>> J-D >>>>> >>>>> On Wed, Feb 13, 2019 at 3:49 AM Boris <boris...@gmail.com> wrote: >>>>> >>>>>> But if we bump threads count to 50, and queue default is 50, we >>>>>> should probably bump queue to 100 or something like that, right? >>>>>> >>>>>> On Wed, Feb 13, 2019, 00:54 Hao Hao <hao....@cloudera.com wrote: >>>>>> >>>>>>> I don't see other flags that are relevant here, maybe others can chime >>>>>>> in. >>>>>>> >>>>>>> For --rpc_service_queue_length, it configs the size of the RPC request >>>>>>> queues. The queue helps to buffer requests in case if there is a bunch >>>>>>> of >>>>>>> them coming at once and service threads are too busy processing >>>>>>> already arrived requests. But I don't see it can help with handling more >>>>>>> concurrent requests. >>>>>>> >>>>>>> Best, >>>>>>> Hao >>>>>>> >>>>>>> On Tue, Feb 12, 2019 at 6:45 PM Boris <boris...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Hao, appreciate your response. >>>>>>>> >>>>>>>> Do we also need to bump other RPC thread related parameters queue >>>>>>>> etc.? >>>>>>>> >>>>>>>> On Tue, Feb 12, 2019, 21:09 Hao Hao <hao....@cloudera.com wrote: >>>>>>>> >>>>>>>>> Hi Boris, >>>>>>>>> >>>>>>>>> Sorry for the delay, --rpc_num_service_threads sets the number of >>>>>>>>> threads in RPC service thread pool (the default is 20 for tablet >>>>>>>>> server, 10 for master). It should help with processing concurrent >>>>>>>>> incoming >>>>>>>>> RPC requests, but increasing it more than the number of available CPU >>>>>>>>> cores >>>>>>>>> of the machines may not bring too much value. >>>>>>>>> >>>>>>>>> You don't need to set the same value for masters and tablet >>>>>>>>> servers. Most of the time, tablet servers should have more RPCs where >>>>>>>>> the >>>>>>>>> scans and writes are taking place. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Hao >>>>>>>>> >>>>>>>>> On Tue, Feb 12, 2019 at 5:29 PM Boris Tyukin < >>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>> >>>>>>>>>> Can someone point us to documentation or explain what these >>>>>>>>>> parameters really mean or how they should be set on production >>>>>>>>>> cluster? >>>>>>>>>> I will greatly appreciate it! >>>>>>>>>> >>>>>>>>>> Boris >>>>>>>>>> >>>>>>>>>> On Fri, Feb 8, 2019 at 3:40 PM Boris Tyukin < >>>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi guys, >>>>>>>>>>> >>>>>>>>>>> we need to process 1000s of operations per second and noticed >>>>>>>>>>> that our Kudu 1.5 cluster was only using 10 threads while our >>>>>>>>>>> application >>>>>>>>>>> spins up 50 clients/threads. We observed in the web UI that only 10 >>>>>>>>>>> threads >>>>>>>>>>> are working and other 40 waiting in the queue. >>>>>>>>>>> >>>>>>>>>>> We found rpc_num_service_threads parameter in the configuration >>>>>>>>>>> guide but it is still not clear to me what we need to adjust >>>>>>>>>>> exactly to >>>>>>>>>>> allow Kudu to handle more concurrent operations. >>>>>>>>>>> >>>>>>>>>>> Do we bump this parameter below or we need to consider other >>>>>>>>>>> rpc related parameters? >>>>>>>>>>> >>>>>>>>>>> Also do we need to use the same numbers for Masters and tablets? >>>>>>>>>>> >>>>>>>>>>> Is there any good numbers to target based on CPU core count? >>>>>>>>>>> >>>>>>>>>>> --rpc_num_service_threads >>>>>>>>>>> <https://kudu.apache.org/docs/configuration_reference.html#kudu-master_rpc_num_service_threads> >>>>>>>>>>> <https://kudu.apache.org/docs/configuration_reference.html#kudu-master_rpc_num_service_threads> >>>>>>>>>>> >>>>>>>>>>> Number of RPC worker threads to run >>>>>>>>>>> >>>>>>>>>>> Type >>>>>>>>>>> >>>>>>>>>>> int32 >>>>>>>>>>> >>>>>>>>>>> Default >>>>>>>>>>> >>>>>>>>>>> 10 >>>>>>>>>>> >>>>>>>>>>> Tags >>>>>>>>>>> >>>>>>>>>>> advanced >>>>>>>>>>> >>>>>>>>>> -- Grant Henke Software Engineer | Cloudera gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke