If that was the case, then we would see 100% CPU usage on all CPU cores, right? Here my question is why only 2.5 cores are saturated while I'm pinning the benchmark to 4 physical cores.
Regards Antoine. Le 24/02/2019 à 14:29, Francois Saint-Jacques a écrit : > A quick glance suggests you're limited by the kernel copying memory around > (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e). > I think the next step is to use different physical hosts for client and > server. This > way you'll free resources for the server. > > François > > > On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou <anto...@python.org> wrote: > >> >> We're talking about the BCC tools, which are not based on perf: >> https://github.com/iovisor/bcc/ >> >> Apparently, using Linux perf for the same purpose is some kind of hassle >> (you need to write perf scripts?). >> >> Regards >> >> Antoine. >> >> >> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : >>> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to >> perf, >>> it'll help the unwinding. Sometimes it's better than the stack pointer >>> method since it keep track of inlined functions. >>> >>> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou <anto...@python.org> >> wrote: >>> >>>> >>>> Ah, thanks. I'm trying it now. The problem is that it doesn't record >>>> userspace stack traces properly (it probably needs all dependencies to >>>> be recompiled with -fno-omit-frame-pointer :-/). So while I know that a >>>> lot of time is spent waiting for futextes, I don't know if that is for a >>>> legitimate reason... >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> Le 21/02/2019 à 17:52, Hatem Helal a écrit : >>>>> I was thinking of this variant: >>>>> >>>>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html >>>>> >>>>> but I must admit that I haven't tried that technique myself. >>>>> >>>>> >>>>> >>>>> On 2/21/19, 4:41 PM, "Antoine Pitrou" <solip...@pitrou.net> wrote: >>>>> >>>>> >>>>> I don't think that's the answer here. The question is not how >>>>> to /visualize/ where time is spent waiting, but how to /measure/ >> it. >>>>> Normal profiling only tells you where CPU time is spent, not what >> the >>>>> process is idly waiting for. >>>>> >>>>> Regards >>>>> >>>>> Antoine. >>>>> >>>>> >>>>> On Thu, 21 Feb 2019 16:29:15 +0000 >>>>> Hatem Helal <hhe...@mathworks.com> wrote: >>>>> > I like flamegraphs for investigating this sort of problem: >>>>> > >>>>> > https://github.com/brendangregg/FlameGraph >>>>> > >>>>> > There are likely many other techniques for inspecting where time >>>> is being spent but that can at least help narrow down the search space. >>>>> > >>>>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < >>>> fsaintjacq...@gmail.com> wrote: >>>>> > >>>>> > Can you remind us what's the easiest way to get flight >> working >>>> with grpc? >>>>> > clone + make install doesn't really work out of the box. >>>>> > >>>>> > François >>>>> > >>>>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < >>>> anto...@python.org> wrote: >>>>> > >>>>> > > >>>>> > > Hello, >>>>> > > >>>>> > > I've been trying to saturate several CPU cores using our >>>> Flight >>>>> > > benchmark (which spawns a server process and attempts to >>>> communicate >>>>> > > with it using multiple clients), but haven't managed to. >>>>> > > >>>>> > > The typical command-line I'm executing is the following: >>>>> > > >>>>> > > $ time taskset -c 1,3,5,7 >>>> ./build/release/arrow-flight-benchmark >>>>> > > -records_per_stream 50000000 -num_streams 16 -num_threads >> 32 >>>>> > > -records_per_batch 120000 >>>>> > > >>>>> > > Breakdown: >>>>> > > >>>>> > > - "time": I want to get CPU user / system / wall-clock >> times >>>>> > > >>>>> > > - "taskset -c ...": I have a 8-core 16-threads machine and >> I >>>> want to >>>>> > > allow scheduling RPC threads on 4 distinct physical cores >>>>> > > >>>>> > > - "-records_per_stream": I want each stream to have enough >>>> records so >>>>> > > that connection / stream setup costs are negligible >>>>> > > >>>>> > > - "-num_streams": this is the number of streams the >>>> benchmark tries to >>>>> > > download (DoGet()) from the server to the client >>>>> > > >>>>> > > - "-num_threads": this is the number of client threads the >>>> benchmark >>>>> > > makes download requests from. Since our client is >>>> currently >>>>> > > blocking, it makes sense to have a large number of client >>>> threads (to >>>>> > > allow overlap). Note that each thread creates a separate >>>> gRPC client >>>>> > > and connection. >>>>> > > >>>>> > > - "-records_per_batch": transfer enough records per >>>> individual RPC >>>>> > > message, to minimize overhead. This number brings us >>>> close to the >>>>> > > default gRPC message limit of 4 MB. >>>>> > > >>>>> > > The results I get look like: >>>>> > > >>>>> > > Bytes read: 25600000000 >>>>> > > Nanos: 8433804781 >>>>> > > Speed: 2894.79 MB/s >>>>> > > >>>>> > > real 0m8,569s >>>>> > > user 0m6,085s >>>>> > > sys 0m15,667s >>>>> > > >>>>> > > >>>>> > > If we divide (user + sys) by real, we conclude that 2.5 >>>> cores are >>>>> > > saturated by this benchmark. Evidently, this means that >> the >>>> benchmark >>>>> > > is waiting a *lot*. The question is: where? >>>>> > > >>>>> > > Here is some things I looked at: >>>>> > > >>>>> > > - mutex usage inside Arrow. None seems to pop up (printf >> is >>>> my friend). >>>>> > > >>>>> > > - number of threads used by the gRPC server. gRPC >>>> implicitly spawns a >>>>> > > number of threads to handle incoming client requests. >>>> I've checked >>>>> > > (using printf...) that several threads are indeed used to >>>> serve >>>>> > > incoming connections. >>>>> > > >>>>> > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU >>>> time is >>>>> > > spent in memcpy() calls in the *client* (precisely, in >> the >>>>> > > grpc_byte_buffer_reader_readall() call inside >>>>> > > arrow::flight::internal::FlightDataDeserialize()). It >>>> doesn't look >>>>> > > like the server is the bottleneck. >>>>> > > >>>>> > > - the benchmark connects to "localhost". I've changed it >> to >>>>> > > "127.0.0.1", it doesn't make a difference. AFAIK, >>>> localhost TCP >>>>> > > connections should be well-optimized on Linux. It seems >>>> highly >>>>> > > unlikely that they would incur idle waiting times (rather >>>> than CPU >>>>> > > time processing packets). >>>>> > > >>>>> > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 >>>> MB >>>>> > > (server). No swapping occurs. >>>>> > > >>>>> > > - Disk I/O. "vmstat" tells me no block I/O happens during >>>> the >>>>> > > benchmark. >>>>> > > >>>>> > > - As a reference, I can transfer 5 GB/s over a single TCP >>>> connection >>>>> > > using plain sockets in a simple Python script. 3 GB/s >>>> over multiple >>>>> > > connections doesn't look terrific. >>>>> > > >>>>> > > >>>>> > > So it looks like there's a scalability issue inside our >>>> current Flight >>>>> > > code, or perhaps inside gRPC. The benchmark itself, if >>>> simplistic, >>>>> > > doesn't look problematic; it should actually be kind of a >>>> best case, >>>>> > > especially with the above parameters. >>>>> > > >>>>> > > Does anyone have any clues or ideas? In particular, is >>>> there a simple >>>>> > > way to diagnose *where* exactly the waiting times happen? >>>>> > > >>>>> > > Regards >>>>> > > >>>>> > > Antoine. >>>>> > > >>>>> > >>>>> > >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >