Re: Flight / gRPC scalability issue

Antoine Pitrou Thu, 21 Feb 2019 09:39:19 -0800


Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
userspace stack traces properly (it probably needs all dependencies to
be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
lot of time is spent waiting for futextes, I don't know if that is for a
legitimate reason...


Regards

Antoine.


Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> I was thinking of this variant:
> 
> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> 
> but I must admit that I haven't tried that technique myself.
> 
> 
> 
> On 2/21/19, 4:41 PM, "Antoine Pitrou" <solip...@pitrou.net> wrote:
> 
>     
>     I don't think that's the answer here.  The question is not how
>     to /visualize/ where time is spent waiting, but how to /measure/ it.
>     Normal profiling only tells you where CPU time is spent, not what the
>     process is idly waiting for.
>     
>     Regards
>     
>     Antoine.
>     
>     
>     On Thu, 21 Feb 2019 16:29:15 +0000
>     Hatem Helal <hhe...@mathworks.com> wrote:
>     > I like flamegraphs for investigating this sort of problem:
>     > 
>     > https://github.com/brendangregg/FlameGraph
>     > 
>     > There are likely many other techniques for inspecting where time is 
> being spent but that can at least help narrow down the search space.
>     > 
>     > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" 
> <fsaintjacq...@gmail.com> wrote:
>     > 
>     >     Can you remind us what's the easiest way to get flight working with 
> grpc?
>     >     clone + make install doesn't really work out of the box.
>     >     
>     >     François
>     >     
>     >     On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
> <anto...@python.org> wrote:
>     >     
>     >     >
>     >     > Hello,
>     >     >
>     >     > I've been trying to saturate several CPU cores using our Flight
>     >     > benchmark (which spawns a server process and attempts to 
> communicate
>     >     > with it using multiple clients), but haven't managed to.
>     >     >
>     >     > The typical command-line I'm executing is the following:
>     >     >
>     >     > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
>     >     > -records_per_stream 50000000 -num_streams 16 -num_threads 32
>     >     > -records_per_batch 120000
>     >     >
>     >     > Breakdown:
>     >     >
>     >     > - "time": I want to get CPU user / system / wall-clock times
>     >     >
>     >     > - "taskset -c ...": I have a 8-core 16-threads machine and I want 
> to
>     >     >   allow scheduling RPC threads on 4 distinct physical cores
>     >     >
>     >     > - "-records_per_stream": I want each stream to have enough 
> records so
>     >     >   that connection / stream setup costs are negligible
>     >     >
>     >     > - "-num_streams": this is the number of streams the benchmark 
> tries to
>     >     >   download (DoGet()) from the server to the client
>     >     >
>     >     > - "-num_threads": this is the number of client threads the 
> benchmark
>     >     >   makes download requests from.  Since our client is currently
>     >     >   blocking, it makes sense to have a large number of client 
> threads (to
>     >     >   allow overlap).  Note that each thread creates a separate gRPC 
> client
>     >     >   and connection.
>     >     >
>     >     > - "-records_per_batch": transfer enough records per individual RPC
>     >     >   message, to minimize overhead.  This number brings us close to 
> the
>     >     >   default gRPC message limit of 4 MB.
>     >     >
>     >     > The results I get look like:
>     >     >
>     >     > Bytes read: 25600000000
>     >     > Nanos: 8433804781
>     >     > Speed: 2894.79 MB/s
>     >     >
>     >     > real    0m8,569s
>     >     > user    0m6,085s
>     >     > sys     0m15,667s
>     >     >
>     >     >
>     >     > If we divide (user + sys) by real, we conclude that 2.5 cores are
>     >     > saturated by this benchmark.  Evidently, this means that the 
> benchmark
>     >     > is waiting a *lot*.  The question is: where?
>     >     >
>     >     > Here is some things I looked at:
>     >     >
>     >     > - mutex usage inside Arrow.  None seems to pop up (printf is my 
> friend).
>     >     >
>     >     > - number of threads used by the gRPC server.  gRPC implicitly 
> spawns a
>     >     >   number of threads to handle incoming client requests.  I've 
> checked
>     >     >   (using printf...) that several threads are indeed used to serve
>     >     >   incoming connections.
>     >     >
>     >     > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time 
> is
>     >     >   spent in memcpy() calls in the *client* (precisely, in the
>     >     >   grpc_byte_buffer_reader_readall() call inside
>     >     >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't 
> look
>     >     >   like the server is the bottleneck.
>     >     >
>     >     > - the benchmark connects to "localhost".  I've changed it to
>     >     >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
>     >     >   connections should be well-optimized on Linux.  It seems highly
>     >     >   unlikely that they would incur idle waiting times (rather than 
> CPU
>     >     >   time processing packets).
>     >     >
>     >     > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
>     >     >   (server).  No swapping occurs.
>     >     >
>     >     > - Disk I/O.  "vmstat" tells me no block I/O happens during the
>     >     >   benchmark.
>     >     >
>     >     > - As a reference, I can transfer 5 GB/s over a single TCP 
> connection
>     >     >   using plain sockets in a simple Python script.  3 GB/s over 
> multiple
>     >     >   connections doesn't look terrific.
>     >     >
>     >     >
>     >     > So it looks like there's a scalability issue inside our current 
> Flight
>     >     > code, or perhaps inside gRPC.  The benchmark itself, if 
> simplistic,
>     >     > doesn't look problematic; it should actually be kind of a best 
> case,
>     >     > especially with the above parameters.
>     >     >
>     >     > Does anyone have any clues or ideas?  In particular, is there a 
> simple
>     >     > way to diagnose *where* exactly the waiting times happen?
>     >     >
>     >     > Regards
>     >     >
>     >     > Antoine.
>     >     >  
>     >     
>     > 
>     
>     
>     
>     
>

Re: Flight / gRPC scalability issue

Reply via email to