Ah, thanks. I'm trying it now. The problem is that it doesn't record userspace stack traces properly (it probably needs all dependencies to be recompiled with -fno-omit-frame-pointer :-/). So while I know that a lot of time is spent waiting for futextes, I don't know if that is for a legitimate reason...
Regards Antoine. Le 21/02/2019 à 17:52, Hatem Helal a écrit : > I was thinking of this variant: > > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > > but I must admit that I haven't tried that technique myself. > > > > On 2/21/19, 4:41 PM, "Antoine Pitrou" <solip...@pitrou.net> wrote: > > > I don't think that's the answer here. The question is not how > to /visualize/ where time is spent waiting, but how to /measure/ it. > Normal profiling only tells you where CPU time is spent, not what the > process is idly waiting for. > > Regards > > Antoine. > > > On Thu, 21 Feb 2019 16:29:15 +0000 > Hatem Helal <hhe...@mathworks.com> wrote: > > I like flamegraphs for investigating this sort of problem: > > > > https://github.com/brendangregg/FlameGraph > > > > There are likely many other techniques for inspecting where time is > being spent but that can at least help narrow down the search space. > > > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" > <fsaintjacq...@gmail.com> wrote: > > > > Can you remind us what's the easiest way to get flight working with > grpc? > > clone + make install doesn't really work out of the box. > > > > François > > > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou > <anto...@python.org> wrote: > > > > > > > > Hello, > > > > > > I've been trying to saturate several CPU cores using our Flight > > > benchmark (which spawns a server process and attempts to > communicate > > > with it using multiple clients), but haven't managed to. > > > > > > The typical command-line I'm executing is the following: > > > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > > -records_per_stream 50000000 -num_streams 16 -num_threads 32 > > > -records_per_batch 120000 > > > > > > Breakdown: > > > > > > - "time": I want to get CPU user / system / wall-clock times > > > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want > to > > > allow scheduling RPC threads on 4 distinct physical cores > > > > > > - "-records_per_stream": I want each stream to have enough > records so > > > that connection / stream setup costs are negligible > > > > > > - "-num_streams": this is the number of streams the benchmark > tries to > > > download (DoGet()) from the server to the client > > > > > > - "-num_threads": this is the number of client threads the > benchmark > > > makes download requests from. Since our client is currently > > > blocking, it makes sense to have a large number of client > threads (to > > > allow overlap). Note that each thread creates a separate gRPC > client > > > and connection. > > > > > > - "-records_per_batch": transfer enough records per individual RPC > > > message, to minimize overhead. This number brings us close to > the > > > default gRPC message limit of 4 MB. > > > > > > The results I get look like: > > > > > > Bytes read: 25600000000 > > > Nanos: 8433804781 > > > Speed: 2894.79 MB/s > > > > > > real 0m8,569s > > > user 0m6,085s > > > sys 0m15,667s > > > > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > > saturated by this benchmark. Evidently, this means that the > benchmark > > > is waiting a *lot*. The question is: where? > > > > > > Here is some things I looked at: > > > > > > - mutex usage inside Arrow. None seems to pop up (printf is my > friend). > > > > > > - number of threads used by the gRPC server. gRPC implicitly > spawns a > > > number of threads to handle incoming client requests. I've > checked > > > (using printf...) that several threads are indeed used to serve > > > incoming connections. > > > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time > is > > > spent in memcpy() calls in the *client* (precisely, in the > > > grpc_byte_buffer_reader_readall() call inside > > > arrow::flight::internal::FlightDataDeserialize()). It doesn't > look > > > like the server is the bottleneck. > > > > > > - the benchmark connects to "localhost". I've changed it to > > > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > > > connections should be well-optimized on Linux. It seems highly > > > unlikely that they would incur idle waiting times (rather than > CPU > > > time processing packets). > > > > > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > > > (server). No swapping occurs. > > > > > > - Disk I/O. "vmstat" tells me no block I/O happens during the > > > benchmark. > > > > > > - As a reference, I can transfer 5 GB/s over a single TCP > connection > > > using plain sockets in a simple Python script. 3 GB/s over > multiple > > > connections doesn't look terrific. > > > > > > > > > So it looks like there's a scalability issue inside our current > Flight > > > code, or perhaps inside gRPC. The benchmark itself, if > simplistic, > > > doesn't look problematic; it should actually be kind of a best > case, > > > especially with the above parameters. > > > > > > Does anyone have any clues or ideas? In particular, is there a > simple > > > way to diagnose *where* exactly the waiting times happen? > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > >