I like flamegraphs for investigating this sort of problem:

https://github.com/brendangregg/FlameGraph

There are likely many other techniques for inspecting where time is being spent 
but that can at least help narrow down the search space.


On 2/21/19, 4:29 PM, "Wes McKinney" <wesmck...@gmail.com> wrote:

    Hi Francois,
    
    It *should* work out of the box. I spent some time to make sure it does.
    Can you open a JIRA?
    
    I recommend using the grpc-cpp conda-forge package.
    
    Wes
    
    On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques <
    fsaintjacq...@gmail.com> wrote:
    
    > Can you remind us what's the easiest way to get flight working with grpc?
    > clone + make install doesn't really work out of the box.
    >
    > François
    >
    > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <anto...@python.org>
    > wrote:
    >
    > >
    > > Hello,
    > >
    > > I've been trying to saturate several CPU cores using our Flight
    > > benchmark (which spawns a server process and attempts to communicate
    > > with it using multiple clients), but haven't managed to.
    > >
    > > The typical command-line I'm executing is the following:
    > >
    > > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
    > > -records_per_stream 50000000 -num_streams 16 -num_threads 32
    > > -records_per_batch 120000
    > >
    > > Breakdown:
    > >
    > > - "time": I want to get CPU user / system / wall-clock times
    > >
    > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
    > >   allow scheduling RPC threads on 4 distinct physical cores
    > >
    > > - "-records_per_stream": I want each stream to have enough records so
    > >   that connection / stream setup costs are negligible
    > >
    > > - "-num_streams": this is the number of streams the benchmark tries to
    > >   download (DoGet()) from the server to the client
    > >
    > > - "-num_threads": this is the number of client threads the benchmark
    > >   makes download requests from.  Since our client is currently
    > >   blocking, it makes sense to have a large number of client threads (to
    > >   allow overlap).  Note that each thread creates a separate gRPC client
    > >   and connection.
    > >
    > > - "-records_per_batch": transfer enough records per individual RPC
    > >   message, to minimize overhead.  This number brings us close to the
    > >   default gRPC message limit of 4 MB.
    > >
    > > The results I get look like:
    > >
    > > Bytes read: 25600000000
    > > Nanos: 8433804781
    > > Speed: 2894.79 MB/s
    > >
    > > real    0m8,569s
    > > user    0m6,085s
    > > sys     0m15,667s
    > >
    > >
    > > If we divide (user + sys) by real, we conclude that 2.5 cores are
    > > saturated by this benchmark.  Evidently, this means that the benchmark
    > > is waiting a *lot*.  The question is: where?
    > >
    > > Here is some things I looked at:
    > >
    > > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
    > >
    > > - number of threads used by the gRPC server.  gRPC implicitly spawns a
    > >   number of threads to handle incoming client requests.  I've checked
    > >   (using printf...) that several threads are indeed used to serve
    > >   incoming connections.
    > >
    > > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
    > >   spent in memcpy() calls in the *client* (precisely, in the
    > >   grpc_byte_buffer_reader_readall() call inside
    > >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
    > >   like the server is the bottleneck.
    > >
    > > - the benchmark connects to "localhost".  I've changed it to
    > >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
    > >   connections should be well-optimized on Linux.  It seems highly
    > >   unlikely that they would incur idle waiting times (rather than CPU
    > >   time processing packets).
    > >
    > > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
    > >   (server).  No swapping occurs.
    > >
    > > - Disk I/O.  "vmstat" tells me no block I/O happens during the
    > >   benchmark.
    > >
    > > - As a reference, I can transfer 5 GB/s over a single TCP connection
    > >   using plain sockets in a simple Python script.  3 GB/s over multiple
    > >   connections doesn't look terrific.
    > >
    > >
    > > So it looks like there's a scalability issue inside our current Flight
    > > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
    > > doesn't look problematic; it should actually be kind of a best case,
    > > especially with the above parameters.
    > >
    > > Does anyone have any clues or ideas?  In particular, is there a simple
    > > way to diagnose *where* exactly the waiting times happen?
    > >
    > > Regards
    > >
    > > Antoine.
    > >
    >
    

Reply via email to