Re: Flight / gRPC scalability issue
It seems like this discussion would be relevant to the gRPC community. There are probably other issues at play, like ensuring that multiple streams through the same port do not block each other too much if one stream has messages of smaller size and another larger size, then the byte slices sent are broken up into smaller pieces. We may want to make some improvements to gRPC to configure it to better suit our performance requirements On Sun, Feb 24, 2019 at 1:19 PM Antoine Pitrou wrote: > > > Le 24/02/2019 à 19:46, Wes McKinney a écrit : > > OK, I don't know enough about sockets or networking to know what > > hypothetical performance is possible with 16 concurrent packet streams > > going through a single port (was the 5GB/s based on a single-threaded > > or multithreaded benchmark? i.e. did it simulate the the equivalent > > number / size / concurrency of packets that the Flight benchmark is > > doing). > > The 5 GB/s uses just two threads: one server thread, one client thread. > The code is almost trivial, it's just careful to avoid spurious copies: > https://gist.github.com/pitrou/86fd433a8f71b0052e29fddcf4d766be > > Regards > > Antoine.
Re: Flight / gRPC scalability issue
Le 24/02/2019 à 19:46, Wes McKinney a écrit : > OK, I don't know enough about sockets or networking to know what > hypothetical performance is possible with 16 concurrent packet streams > going through a single port (was the 5GB/s based on a single-threaded > or multithreaded benchmark? i.e. did it simulate the the equivalent > number / size / concurrency of packets that the Flight benchmark is > doing). The 5 GB/s uses just two threads: one server thread, one client thread. The code is almost trivial, it's just careful to avoid spurious copies: https://gist.github.com/pitrou/86fd433a8f71b0052e29fddcf4d766be Regards Antoine.
Re: Flight / gRPC scalability issue
OK, I don't know enough about sockets or networking to know what hypothetical performance is possible with 16 concurrent packet streams going through a single port (was the 5GB/s based on a single-threaded or multithreaded benchmark? i.e. did it simulate the the equivalent number / size / concurrency of packets that the Flight benchmark is doing). If the CPU cores aren't being saturated then I guess IO is blocking in some way. It might be best to involve folks from the gRPC community who are more expert in this domain. To me, > 20 GBit/sec seems like acceptable throughput, considering that networking faster than 10 gigabit is relatively exotic. I don't think that optimizing for > 10GBit network was even a short term goal for Flight. For faster networks I would guess we're going to be getting into RDMA for moving IPC payloads rather than going through TCP On Sun, Feb 24, 2019 at 12:23 PM Antoine Pitrou wrote: > > > Le 24/02/2019 à 18:35, Wes McKinney a écrit : > > hi Antoine, > > > > All of the Flight traffic is going through a hard-coded single port > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185 > > > > What happens if you spin up a different server (and use a different > > port) for each thread? I'm surprised no one else has mentioned this > > yet > > Well that's not the expected usage model for a server, either :-) If > you run an HTTP server, for example, you don't expect to have to open > different ports on the same machine (rather than only port 80 or 443) to > get good scalability. > > Regards > > Antoine.
Re: Flight / gRPC scalability issue
Le 24/02/2019 à 18:35, Wes McKinney a écrit : > hi Antoine, > > All of the Flight traffic is going through a hard-coded single port > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185 > > What happens if you spin up a different server (and use a different > port) for each thread? I'm surprised no one else has mentioned this > yet Well that's not the expected usage model for a server, either :-) If you run an HTTP server, for example, you don't expect to have to open different ports on the same machine (rather than only port 80 or 443) to get good scalability. Regards Antoine.
Re: Flight / gRPC scalability issue
hi Antoine, All of the Flight traffic is going through a hard-coded single port https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185 What happens if you spin up a different server (and use a different port) for each thread? I'm surprised no one else has mentioned this yet https://issues.apache.org/jira/browse/ARROW-3330 - Wes On Sun, Feb 24, 2019 at 9:20 AM Antoine Pitrou wrote: > > > If that was the case, then we would see 100% CPU usage on all CPU cores, > right? Here my question is why only 2.5 cores are saturated while I'm > pinning the benchmark to 4 physical cores. > > Regards > > Antoine. > > > Le 24/02/2019 à 14:29, Francois Saint-Jacques a écrit : > > A quick glance suggests you're limited by the kernel copying memory around > > (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e). > > I think the next step is to use different physical hosts for client and > > server. This > > way you'll free resources for the server. > > > > François > > > > > > On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou wrote: > > > >> > >> We're talking about the BCC tools, which are not based on perf: > >> https://github.com/iovisor/bcc/ > >> > >> Apparently, using Linux perf for the same purpose is some kind of hassle > >> (you need to write perf scripts?). > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : > >>> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to > >> perf, > >>> it'll help the unwinding. Sometimes it's better than the stack pointer > >>> method since it keep track of inlined functions. > >>> > >>> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou > >> wrote: > >>> > > Ah, thanks. I'm trying it now. The problem is that it doesn't record > userspace stack traces properly (it probably needs all dependencies to > be recompiled with -fno-omit-frame-pointer :-/). So while I know that a > lot of time is spent waiting for futextes, I don't know if that is for a > legitimate reason... > > Regards > > Antoine. > > > Le 21/02/2019 à 17:52, Hatem Helal a écrit : > > I was thinking of this variant: > > > > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > > > > but I must admit that I haven't tried that technique myself. > > > > > > > > On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: > > > > > > I don't think that's the answer here. The question is not how > > to /visualize/ where time is spent waiting, but how to /measure/ > >> it. > > Normal profiling only tells you where CPU time is spent, not what > >> the > > process is idly waiting for. > > > > Regards > > > > Antoine. > > > > > > On Thu, 21 Feb 2019 16:29:15 + > > Hatem Helal wrote: > > > I like flamegraphs for investigating this sort of problem: > > > > > > https://github.com/brendangregg/FlameGraph > > > > > > There are likely many other techniques for inspecting where time > is being spent but that can at least help narrow down the search space. > > > > > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < > fsaintjacq...@gmail.com> wrote: > > > > > > Can you remind us what's the easiest way to get flight > >> working > with grpc? > > > clone + make install doesn't really work out of the box. > > > > > > François > > > > > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < > anto...@python.org> wrote: > > > > > > > > > > > Hello, > > > > > > > > I've been trying to saturate several CPU cores using our > Flight > > > > benchmark (which spawns a server process and attempts to > communicate > > > > with it using multiple clients), but haven't managed to. > > > > > > > > The typical command-line I'm executing is the following: > > > > > > > > $ time taskset -c 1,3,5,7 > ./build/release/arrow-flight-benchmark > > > > -records_per_stream 5000 -num_streams 16 -num_threads > >> 32 > > > > -records_per_batch 12 > > > > > > > > Breakdown: > > > > > > > > - "time": I want to get CPU user / system / wall-clock > >> times > > > > > > > > - "taskset -c ...": I have a 8-core 16-threads machine and > >> I > want to > > > > allow scheduling RPC threads on 4 distinct physical cores > > > > > > > > - "-records_per_stream": I want each stream to have enough > records so > > > > that connection / stream setup costs are negligible > > > > > > > > - "-num_strea
Re: Flight / gRPC scalability issue
If that was the case, then we would see 100% CPU usage on all CPU cores, right? Here my question is why only 2.5 cores are saturated while I'm pinning the benchmark to 4 physical cores. Regards Antoine. Le 24/02/2019 à 14:29, Francois Saint-Jacques a écrit : > A quick glance suggests you're limited by the kernel copying memory around > (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e). > I think the next step is to use different physical hosts for client and > server. This > way you'll free resources for the server. > > François > > > On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou wrote: > >> >> We're talking about the BCC tools, which are not based on perf: >> https://github.com/iovisor/bcc/ >> >> Apparently, using Linux perf for the same purpose is some kind of hassle >> (you need to write perf scripts?). >> >> Regards >> >> Antoine. >> >> >> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : >>> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to >> perf, >>> it'll help the unwinding. Sometimes it's better than the stack pointer >>> method since it keep track of inlined functions. >>> >>> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou >> wrote: >>> Ah, thanks. I'm trying it now. The problem is that it doesn't record userspace stack traces properly (it probably needs all dependencies to be recompiled with -fno-omit-frame-pointer :-/). So while I know that a lot of time is spent waiting for futextes, I don't know if that is for a legitimate reason... Regards Antoine. Le 21/02/2019 à 17:52, Hatem Helal a écrit : > I was thinking of this variant: > > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > > but I must admit that I haven't tried that technique myself. > > > > On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: > > > I don't think that's the answer here. The question is not how > to /visualize/ where time is spent waiting, but how to /measure/ >> it. > Normal profiling only tells you where CPU time is spent, not what >> the > process is idly waiting for. > > Regards > > Antoine. > > > On Thu, 21 Feb 2019 16:29:15 + > Hatem Helal wrote: > > I like flamegraphs for investigating this sort of problem: > > > > https://github.com/brendangregg/FlameGraph > > > > There are likely many other techniques for inspecting where time is being spent but that can at least help narrow down the search space. > > > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < fsaintjacq...@gmail.com> wrote: > > > > Can you remind us what's the easiest way to get flight >> working with grpc? > > clone + make install doesn't really work out of the box. > > > > François > > > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < anto...@python.org> wrote: > > > > > > > > Hello, > > > > > > I've been trying to saturate several CPU cores using our Flight > > > benchmark (which spawns a server process and attempts to communicate > > > with it using multiple clients), but haven't managed to. > > > > > > The typical command-line I'm executing is the following: > > > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > > -records_per_stream 5000 -num_streams 16 -num_threads >> 32 > > > -records_per_batch 12 > > > > > > Breakdown: > > > > > > - "time": I want to get CPU user / system / wall-clock >> times > > > > > > - "taskset -c ...": I have a 8-core 16-threads machine and >> I want to > > > allow scheduling RPC threads on 4 distinct physical cores > > > > > > - "-records_per_stream": I want each stream to have enough records so > > > that connection / stream setup costs are negligible > > > > > > - "-num_streams": this is the number of streams the benchmark tries to > > > download (DoGet()) from the server to the client > > > > > > - "-num_threads": this is the number of client threads the benchmark > > > makes download requests from. Since our client is currently > > > blocking, it makes sense to have a large number of client threads (to > > > allow overlap). Note that each thread creates a separate gRPC client > > > and connection. > > > > > > - "-records_per_batch": transfer enough records per individual RPC > > > me
Re: Flight / gRPC scalability issue
A quick glance suggests you're limited by the kernel copying memory around (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e). I think the next step is to use different physical hosts for client and server. This way you'll free resources for the server. François On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou wrote: > > We're talking about the BCC tools, which are not based on perf: > https://github.com/iovisor/bcc/ > > Apparently, using Linux perf for the same purpose is some kind of hassle > (you need to write perf scripts?). > > Regards > > Antoine. > > > Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : > > You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to > perf, > > it'll help the unwinding. Sometimes it's better than the stack pointer > > method since it keep track of inlined functions. > > > > On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou > wrote: > > > >> > >> Ah, thanks. I'm trying it now. The problem is that it doesn't record > >> userspace stack traces properly (it probably needs all dependencies to > >> be recompiled with -fno-omit-frame-pointer :-/). So while I know that a > >> lot of time is spent waiting for futextes, I don't know if that is for a > >> legitimate reason... > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 21/02/2019 à 17:52, Hatem Helal a écrit : > >>> I was thinking of this variant: > >>> > >>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > >>> > >>> but I must admit that I haven't tried that technique myself. > >>> > >>> > >>> > >>> On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: > >>> > >>> > >>> I don't think that's the answer here. The question is not how > >>> to /visualize/ where time is spent waiting, but how to /measure/ > it. > >>> Normal profiling only tells you where CPU time is spent, not what > the > >>> process is idly waiting for. > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > >>> On Thu, 21 Feb 2019 16:29:15 + > >>> Hatem Helal wrote: > >>> > I like flamegraphs for investigating this sort of problem: > >>> > > >>> > https://github.com/brendangregg/FlameGraph > >>> > > >>> > There are likely many other techniques for inspecting where time > >> is being spent but that can at least help narrow down the search space. > >>> > > >>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < > >> fsaintjacq...@gmail.com> wrote: > >>> > > >>> > Can you remind us what's the easiest way to get flight > working > >> with grpc? > >>> > clone + make install doesn't really work out of the box. > >>> > > >>> > François > >>> > > >>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < > >> anto...@python.org> wrote: > >>> > > >>> > > > >>> > > Hello, > >>> > > > >>> > > I've been trying to saturate several CPU cores using our > >> Flight > >>> > > benchmark (which spawns a server process and attempts to > >> communicate > >>> > > with it using multiple clients), but haven't managed to. > >>> > > > >>> > > The typical command-line I'm executing is the following: > >>> > > > >>> > > $ time taskset -c 1,3,5,7 > >> ./build/release/arrow-flight-benchmark > >>> > > -records_per_stream 5000 -num_streams 16 -num_threads > 32 > >>> > > -records_per_batch 12 > >>> > > > >>> > > Breakdown: > >>> > > > >>> > > - "time": I want to get CPU user / system / wall-clock > times > >>> > > > >>> > > - "taskset -c ...": I have a 8-core 16-threads machine and > I > >> want to > >>> > > allow scheduling RPC threads on 4 distinct physical cores > >>> > > > >>> > > - "-records_per_stream": I want each stream to have enough > >> records so > >>> > > that connection / stream setup costs are negligible > >>> > > > >>> > > - "-num_streams": this is the number of streams the > >> benchmark tries to > >>> > > download (DoGet()) from the server to the client > >>> > > > >>> > > - "-num_threads": this is the number of client threads the > >> benchmark > >>> > > makes download requests from. Since our client is > >> currently > >>> > > blocking, it makes sense to have a large number of client > >> threads (to > >>> > > allow overlap). Note that each thread creates a separate > >> gRPC client > >>> > > and connection. > >>> > > > >>> > > - "-records_per_batch": transfer enough records per > >> individual RPC > >>> > > message, to minimize overhead. This number brings us > >> close to the > >>> > > default gRPC message limit of 4 MB. > >>> > > > >>> > > The results I get look like: > >>> > > > >>> > > Bytes read: 256 > >>> > > Nanos: 8433804781 > >>> > > Spee
Re: Flight / gRPC scalability issue
We're talking about the BCC tools, which are not based on perf: https://github.com/iovisor/bcc/ Apparently, using Linux perf for the same purpose is some kind of hassle (you need to write perf scripts?). Regards Antoine. Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : > You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to perf, > it'll help the unwinding. Sometimes it's better than the stack pointer > method since it keep track of inlined functions. > > On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou wrote: > >> >> Ah, thanks. I'm trying it now. The problem is that it doesn't record >> userspace stack traces properly (it probably needs all dependencies to >> be recompiled with -fno-omit-frame-pointer :-/). So while I know that a >> lot of time is spent waiting for futextes, I don't know if that is for a >> legitimate reason... >> >> Regards >> >> Antoine. >> >> >> Le 21/02/2019 à 17:52, Hatem Helal a écrit : >>> I was thinking of this variant: >>> >>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html >>> >>> but I must admit that I haven't tried that technique myself. >>> >>> >>> >>> On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: >>> >>> >>> I don't think that's the answer here. The question is not how >>> to /visualize/ where time is spent waiting, but how to /measure/ it. >>> Normal profiling only tells you where CPU time is spent, not what the >>> process is idly waiting for. >>> >>> Regards >>> >>> Antoine. >>> >>> >>> On Thu, 21 Feb 2019 16:29:15 + >>> Hatem Helal wrote: >>> > I like flamegraphs for investigating this sort of problem: >>> > >>> > https://github.com/brendangregg/FlameGraph >>> > >>> > There are likely many other techniques for inspecting where time >> is being spent but that can at least help narrow down the search space. >>> > >>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < >> fsaintjacq...@gmail.com> wrote: >>> > >>> > Can you remind us what's the easiest way to get flight working >> with grpc? >>> > clone + make install doesn't really work out of the box. >>> > >>> > François >>> > >>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < >> anto...@python.org> wrote: >>> > >>> > > >>> > > Hello, >>> > > >>> > > I've been trying to saturate several CPU cores using our >> Flight >>> > > benchmark (which spawns a server process and attempts to >> communicate >>> > > with it using multiple clients), but haven't managed to. >>> > > >>> > > The typical command-line I'm executing is the following: >>> > > >>> > > $ time taskset -c 1,3,5,7 >> ./build/release/arrow-flight-benchmark >>> > > -records_per_stream 5000 -num_streams 16 -num_threads 32 >>> > > -records_per_batch 12 >>> > > >>> > > Breakdown: >>> > > >>> > > - "time": I want to get CPU user / system / wall-clock times >>> > > >>> > > - "taskset -c ...": I have a 8-core 16-threads machine and I >> want to >>> > > allow scheduling RPC threads on 4 distinct physical cores >>> > > >>> > > - "-records_per_stream": I want each stream to have enough >> records so >>> > > that connection / stream setup costs are negligible >>> > > >>> > > - "-num_streams": this is the number of streams the >> benchmark tries to >>> > > download (DoGet()) from the server to the client >>> > > >>> > > - "-num_threads": this is the number of client threads the >> benchmark >>> > > makes download requests from. Since our client is >> currently >>> > > blocking, it makes sense to have a large number of client >> threads (to >>> > > allow overlap). Note that each thread creates a separate >> gRPC client >>> > > and connection. >>> > > >>> > > - "-records_per_batch": transfer enough records per >> individual RPC >>> > > message, to minimize overhead. This number brings us >> close to the >>> > > default gRPC message limit of 4 MB. >>> > > >>> > > The results I get look like: >>> > > >>> > > Bytes read: 256 >>> > > Nanos: 8433804781 >>> > > Speed: 2894.79 MB/s >>> > > >>> > > real0m8,569s >>> > > user0m6,085s >>> > > sys 0m15,667s >>> > > >>> > > >>> > > If we divide (user + sys) by real, we conclude that 2.5 >> cores are >>> > > saturated by this benchmark. Evidently, this means that the >> benchmark >>> > > is waiting a *lot*. The question is: where? >>> > > >>> > > Here is some things I looked at: >>> > > >>> > > - mutex usage inside Arrow. None seems to pop up (printf is >> my friend). >>> > > >>> > > -
Re: Flight / gRPC scalability issue
You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to perf, it'll help the unwinding. Sometimes it's better than the stack pointer method since it keep track of inlined functions. On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou wrote: > > Ah, thanks. I'm trying it now. The problem is that it doesn't record > userspace stack traces properly (it probably needs all dependencies to > be recompiled with -fno-omit-frame-pointer :-/). So while I know that a > lot of time is spent waiting for futextes, I don't know if that is for a > legitimate reason... > > Regards > > Antoine. > > > Le 21/02/2019 à 17:52, Hatem Helal a écrit : > > I was thinking of this variant: > > > > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > > > > but I must admit that I haven't tried that technique myself. > > > > > > > > On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: > > > > > > I don't think that's the answer here. The question is not how > > to /visualize/ where time is spent waiting, but how to /measure/ it. > > Normal profiling only tells you where CPU time is spent, not what the > > process is idly waiting for. > > > > Regards > > > > Antoine. > > > > > > On Thu, 21 Feb 2019 16:29:15 + > > Hatem Helal wrote: > > > I like flamegraphs for investigating this sort of problem: > > > > > > https://github.com/brendangregg/FlameGraph > > > > > > There are likely many other techniques for inspecting where time > is being spent but that can at least help narrow down the search space. > > > > > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < > fsaintjacq...@gmail.com> wrote: > > > > > > Can you remind us what's the easiest way to get flight working > with grpc? > > > clone + make install doesn't really work out of the box. > > > > > > François > > > > > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < > anto...@python.org> wrote: > > > > > > > > > > > Hello, > > > > > > > > I've been trying to saturate several CPU cores using our > Flight > > > > benchmark (which spawns a server process and attempts to > communicate > > > > with it using multiple clients), but haven't managed to. > > > > > > > > The typical command-line I'm executing is the following: > > > > > > > > $ time taskset -c 1,3,5,7 > ./build/release/arrow-flight-benchmark > > > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > > > -records_per_batch 12 > > > > > > > > Breakdown: > > > > > > > > - "time": I want to get CPU user / system / wall-clock times > > > > > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I > want to > > > > allow scheduling RPC threads on 4 distinct physical cores > > > > > > > > - "-records_per_stream": I want each stream to have enough > records so > > > > that connection / stream setup costs are negligible > > > > > > > > - "-num_streams": this is the number of streams the > benchmark tries to > > > > download (DoGet()) from the server to the client > > > > > > > > - "-num_threads": this is the number of client threads the > benchmark > > > > makes download requests from. Since our client is > currently > > > > blocking, it makes sense to have a large number of client > threads (to > > > > allow overlap). Note that each thread creates a separate > gRPC client > > > > and connection. > > > > > > > > - "-records_per_batch": transfer enough records per > individual RPC > > > > message, to minimize overhead. This number brings us > close to the > > > > default gRPC message limit of 4 MB. > > > > > > > > The results I get look like: > > > > > > > > Bytes read: 256 > > > > Nanos: 8433804781 > > > > Speed: 2894.79 MB/s > > > > > > > > real0m8,569s > > > > user0m6,085s > > > > sys 0m15,667s > > > > > > > > > > > > If we divide (user + sys) by real, we conclude that 2.5 > cores are > > > > saturated by this benchmark. Evidently, this means that the > benchmark > > > > is waiting a *lot*. The question is: where? > > > > > > > > Here is some things I looked at: > > > > > > > > - mutex usage inside Arrow. None seems to pop up (printf is > my friend). > > > > > > > > - number of threads used by the gRPC server. gRPC > implicitly spawns a > > > > number of threads to handle incoming client requests. > I've checked > > > > (using printf...) that several threads are indeed used to > serve > > > > incoming connections. > > > > > > > > - CPU usage b
Re: Flight / gRPC scalability issue
Ah, thanks. I'm trying it now. The problem is that it doesn't record userspace stack traces properly (it probably needs all dependencies to be recompiled with -fno-omit-frame-pointer :-/). So while I know that a lot of time is spent waiting for futextes, I don't know if that is for a legitimate reason... Regards Antoine. Le 21/02/2019 à 17:52, Hatem Helal a écrit : > I was thinking of this variant: > > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > > but I must admit that I haven't tried that technique myself. > > > > On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: > > > I don't think that's the answer here. The question is not how > to /visualize/ where time is spent waiting, but how to /measure/ it. > Normal profiling only tells you where CPU time is spent, not what the > process is idly waiting for. > > Regards > > Antoine. > > > On Thu, 21 Feb 2019 16:29:15 + > Hatem Helal wrote: > > I like flamegraphs for investigating this sort of problem: > > > > https://github.com/brendangregg/FlameGraph > > > > There are likely many other techniques for inspecting where time is > being spent but that can at least help narrow down the search space. > > > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" > wrote: > > > > Can you remind us what's the easiest way to get flight working with > grpc? > > clone + make install doesn't really work out of the box. > > > > François > > > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou > wrote: > > > > > > > > Hello, > > > > > > I've been trying to saturate several CPU cores using our Flight > > > benchmark (which spawns a server process and attempts to > communicate > > > with it using multiple clients), but haven't managed to. > > > > > > The typical command-line I'm executing is the following: > > > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > > -records_per_batch 12 > > > > > > Breakdown: > > > > > > - "time": I want to get CPU user / system / wall-clock times > > > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want > to > > > allow scheduling RPC threads on 4 distinct physical cores > > > > > > - "-records_per_stream": I want each stream to have enough > records so > > > that connection / stream setup costs are negligible > > > > > > - "-num_streams": this is the number of streams the benchmark > tries to > > > download (DoGet()) from the server to the client > > > > > > - "-num_threads": this is the number of client threads the > benchmark > > > makes download requests from. Since our client is currently > > > blocking, it makes sense to have a large number of client > threads (to > > > allow overlap). Note that each thread creates a separate gRPC > client > > > and connection. > > > > > > - "-records_per_batch": transfer enough records per individual RPC > > > message, to minimize overhead. This number brings us close to > the > > > default gRPC message limit of 4 MB. > > > > > > The results I get look like: > > > > > > Bytes read: 256 > > > Nanos: 8433804781 > > > Speed: 2894.79 MB/s > > > > > > real0m8,569s > > > user0m6,085s > > > sys 0m15,667s > > > > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > > saturated by this benchmark. Evidently, this means that the > benchmark > > > is waiting a *lot*. The question is: where? > > > > > > Here is some things I looked at: > > > > > > - mutex usage inside Arrow. None seems to pop up (printf is my > friend). > > > > > > - number of threads used by the gRPC server. gRPC implicitly > spawns a > > > number of threads to handle incoming client requests. I've > checked > > > (using printf...) that several threads are indeed used to serve > > > incoming connections. > > > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time > is > > > spent in memcpy() calls in the *client* (precisely, in the > > > grpc_byte_buffer_reader_readall() call inside > > > arrow::flight::internal::FlightDataDeserialize()). It doesn't > look > > > like the server is the bottleneck. > > > > > > - the benchmark connects to "localhost". I've changed it to > > > "127.0.0.1", it doesn't m
Re: Flight / gRPC scalability issue
I was thinking of this variant: http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html but I must admit that I haven't tried that technique myself. On 2/21/19, 4:41 PM, "Antoine Pitrou" wrote: I don't think that's the answer here. The question is not how to /visualize/ where time is spent waiting, but how to /measure/ it. Normal profiling only tells you where CPU time is spent, not what the process is idly waiting for. Regards Antoine. On Thu, 21 Feb 2019 16:29:15 + Hatem Helal wrote: > I like flamegraphs for investigating this sort of problem: > > https://github.com/brendangregg/FlameGraph > > There are likely many other techniques for inspecting where time is being spent but that can at least help narrow down the search space. > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" wrote: > > Can you remind us what's the easiest way to get flight working with grpc? > clone + make install doesn't really work out of the box. > > François > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou wrote: > > > > > Hello, > > > > I've been trying to saturate several CPU cores using our Flight > > benchmark (which spawns a server process and attempts to communicate > > with it using multiple clients), but haven't managed to. > > > > The typical command-line I'm executing is the following: > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > -records_per_batch 12 > > > > Breakdown: > > > > - "time": I want to get CPU user / system / wall-clock times > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > > allow scheduling RPC threads on 4 distinct physical cores > > > > - "-records_per_stream": I want each stream to have enough records so > > that connection / stream setup costs are negligible > > > > - "-num_streams": this is the number of streams the benchmark tries to > > download (DoGet()) from the server to the client > > > > - "-num_threads": this is the number of client threads the benchmark > > makes download requests from. Since our client is currently > > blocking, it makes sense to have a large number of client threads (to > > allow overlap). Note that each thread creates a separate gRPC client > > and connection. > > > > - "-records_per_batch": transfer enough records per individual RPC > > message, to minimize overhead. This number brings us close to the > > default gRPC message limit of 4 MB. > > > > The results I get look like: > > > > Bytes read: 256 > > Nanos: 8433804781 > > Speed: 2894.79 MB/s > > > > real0m8,569s > > user0m6,085s > > sys 0m15,667s > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > saturated by this benchmark. Evidently, this means that the benchmark > > is waiting a *lot*. The question is: where? > > > > Here is some things I looked at: > > > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > > > - number of threads used by the gRPC server. gRPC implicitly spawns a > > number of threads to handle incoming client requests. I've checked > > (using printf...) that several threads are indeed used to serve > > incoming connections. > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > > spent in memcpy() calls in the *client* (precisely, in the > > grpc_byte_buffer_reader_readall() call inside > > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > > like the server is the bottleneck. > > > > - the benchmark connects to "localhost". I've changed it to > > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > > connections should be well-optimized on Linux. It seems highly > > unlikely that they would incur idle waiting times (rather than CPU > > time processing packets). > > > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > > (server). No swapping occurs. > > > > - Disk I/O. "vmstat" tells me no block I/O happens during the > > benchmark. > > > > - As a reference, I can transfer 5 GB/s over a single TCP connection > > using plain sockets in a simple Python s
Re: Flight / gRPC scalability issue
I don't think that's the answer here. The question is not how to /visualize/ where time is spent waiting, but how to /measure/ it. Normal profiling only tells you where CPU time is spent, not what the process is idly waiting for. Regards Antoine. On Thu, 21 Feb 2019 16:29:15 + Hatem Helal wrote: > I like flamegraphs for investigating this sort of problem: > > https://github.com/brendangregg/FlameGraph > > There are likely many other techniques for inspecting where time is being > spent but that can at least help narrow down the search space. > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" > wrote: > > Can you remind us what's the easiest way to get flight working with grpc? > clone + make install doesn't really work out of the box. > > François > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou > wrote: > > > > > Hello, > > > > I've been trying to saturate several CPU cores using our Flight > > benchmark (which spawns a server process and attempts to communicate > > with it using multiple clients), but haven't managed to. > > > > The typical command-line I'm executing is the following: > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > -records_per_batch 12 > > > > Breakdown: > > > > - "time": I want to get CPU user / system / wall-clock times > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > > allow scheduling RPC threads on 4 distinct physical cores > > > > - "-records_per_stream": I want each stream to have enough records so > > that connection / stream setup costs are negligible > > > > - "-num_streams": this is the number of streams the benchmark tries to > > download (DoGet()) from the server to the client > > > > - "-num_threads": this is the number of client threads the benchmark > > makes download requests from. Since our client is currently > > blocking, it makes sense to have a large number of client threads (to > > allow overlap). Note that each thread creates a separate gRPC client > > and connection. > > > > - "-records_per_batch": transfer enough records per individual RPC > > message, to minimize overhead. This number brings us close to the > > default gRPC message limit of 4 MB. > > > > The results I get look like: > > > > Bytes read: 256 > > Nanos: 8433804781 > > Speed: 2894.79 MB/s > > > > real0m8,569s > > user0m6,085s > > sys 0m15,667s > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > saturated by this benchmark. Evidently, this means that the benchmark > > is waiting a *lot*. The question is: where? > > > > Here is some things I looked at: > > > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > > > - number of threads used by the gRPC server. gRPC implicitly spawns a > > number of threads to handle incoming client requests. I've checked > > (using printf...) that several threads are indeed used to serve > > incoming connections. > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > > spent in memcpy() calls in the *client* (precisely, in the > > grpc_byte_buffer_reader_readall() call inside > > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > > like the server is the bottleneck. > > > > - the benchmark connects to "localhost". I've changed it to > > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > > connections should be well-optimized on Linux. It seems highly > > unlikely that they would incur idle waiting times (rather than CPU > > time processing packets). > > > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > > (server). No swapping occurs. > > > > - Disk I/O. "vmstat" tells me no block I/O happens during the > > benchmark. > > > > - As a reference, I can transfer 5 GB/s over a single TCP connection > > using plain sockets in a simple Python script. 3 GB/s over multiple > > connections doesn't look terrific. > > > > > > So it looks like there's a scalability issue inside our current Flight > > code, or perhaps inside gRPC. The benchmark itself, if simplistic, > > doesn't look problematic; it should actually be kind of a best case, > > especially with the above parameters. > > > > Does anyone have any clues or ideas? In particular, is there a simple > > way to diagnose *where* exactly the waiting times happen? > > > > Regards > > > > Antoine. > > > >
Re: Flight / gRPC scalability issue
I like flamegraphs for investigating this sort of problem: https://github.com/brendangregg/FlameGraph There are likely many other techniques for inspecting where time is being spent but that can at least help narrow down the search space. On 2/21/19, 4:03 PM, "Francois Saint-Jacques" wrote: Can you remind us what's the easiest way to get flight working with grpc? clone + make install doesn't really work out of the box. François On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou wrote: > > Hello, > > I've been trying to saturate several CPU cores using our Flight > benchmark (which spawns a server process and attempts to communicate > with it using multiple clients), but haven't managed to. > > The typical command-line I'm executing is the following: > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > -records_per_stream 5000 -num_streams 16 -num_threads 32 > -records_per_batch 12 > > Breakdown: > > - "time": I want to get CPU user / system / wall-clock times > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > allow scheduling RPC threads on 4 distinct physical cores > > - "-records_per_stream": I want each stream to have enough records so > that connection / stream setup costs are negligible > > - "-num_streams": this is the number of streams the benchmark tries to > download (DoGet()) from the server to the client > > - "-num_threads": this is the number of client threads the benchmark > makes download requests from. Since our client is currently > blocking, it makes sense to have a large number of client threads (to > allow overlap). Note that each thread creates a separate gRPC client > and connection. > > - "-records_per_batch": transfer enough records per individual RPC > message, to minimize overhead. This number brings us close to the > default gRPC message limit of 4 MB. > > The results I get look like: > > Bytes read: 256 > Nanos: 8433804781 > Speed: 2894.79 MB/s > > real0m8,569s > user0m6,085s > sys 0m15,667s > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > saturated by this benchmark. Evidently, this means that the benchmark > is waiting a *lot*. The question is: where? > > Here is some things I looked at: > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > - number of threads used by the gRPC server. gRPC implicitly spawns a > number of threads to handle incoming client requests. I've checked > (using printf...) that several threads are indeed used to serve > incoming connections. > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > spent in memcpy() calls in the *client* (precisely, in the > grpc_byte_buffer_reader_readall() call inside > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > like the server is the bottleneck. > > - the benchmark connects to "localhost". I've changed it to > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > connections should be well-optimized on Linux. It seems highly > unlikely that they would incur idle waiting times (rather than CPU > time processing packets). > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > (server). No swapping occurs. > > - Disk I/O. "vmstat" tells me no block I/O happens during the > benchmark. > > - As a reference, I can transfer 5 GB/s over a single TCP connection > using plain sockets in a simple Python script. 3 GB/s over multiple > connections doesn't look terrific. > > > So it looks like there's a scalability issue inside our current Flight > code, or perhaps inside gRPC. The benchmark itself, if simplistic, > doesn't look problematic; it should actually be kind of a best case, > especially with the above parameters. > > Does anyone have any clues or ideas? In particular, is there a simple > way to diagnose *where* exactly the waiting times happen? > > Regards > > Antoine. >
Re: Flight / gRPC scalability issue
I like flamegraphs for investigating this sort of problem: https://github.com/brendangregg/FlameGraph There are likely many other techniques for inspecting where time is being spent but that can at least help narrow down the search space. On 2/21/19, 4:29 PM, "Wes McKinney" wrote: Hi Francois, It *should* work out of the box. I spent some time to make sure it does. Can you open a JIRA? I recommend using the grpc-cpp conda-forge package. Wes On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > Can you remind us what's the easiest way to get flight working with grpc? > clone + make install doesn't really work out of the box. > > François > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou > wrote: > > > > > Hello, > > > > I've been trying to saturate several CPU cores using our Flight > > benchmark (which spawns a server process and attempts to communicate > > with it using multiple clients), but haven't managed to. > > > > The typical command-line I'm executing is the following: > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > -records_per_batch 12 > > > > Breakdown: > > > > - "time": I want to get CPU user / system / wall-clock times > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > > allow scheduling RPC threads on 4 distinct physical cores > > > > - "-records_per_stream": I want each stream to have enough records so > > that connection / stream setup costs are negligible > > > > - "-num_streams": this is the number of streams the benchmark tries to > > download (DoGet()) from the server to the client > > > > - "-num_threads": this is the number of client threads the benchmark > > makes download requests from. Since our client is currently > > blocking, it makes sense to have a large number of client threads (to > > allow overlap). Note that each thread creates a separate gRPC client > > and connection. > > > > - "-records_per_batch": transfer enough records per individual RPC > > message, to minimize overhead. This number brings us close to the > > default gRPC message limit of 4 MB. > > > > The results I get look like: > > > > Bytes read: 256 > > Nanos: 8433804781 > > Speed: 2894.79 MB/s > > > > real0m8,569s > > user0m6,085s > > sys 0m15,667s > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > saturated by this benchmark. Evidently, this means that the benchmark > > is waiting a *lot*. The question is: where? > > > > Here is some things I looked at: > > > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > > > - number of threads used by the gRPC server. gRPC implicitly spawns a > > number of threads to handle incoming client requests. I've checked > > (using printf...) that several threads are indeed used to serve > > incoming connections. > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > > spent in memcpy() calls in the *client* (precisely, in the > > grpc_byte_buffer_reader_readall() call inside > > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > > like the server is the bottleneck. > > > > - the benchmark connects to "localhost". I've changed it to > > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > > connections should be well-optimized on Linux. It seems highly > > unlikely that they would incur idle waiting times (rather than CPU > > time processing packets). > > > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > > (server). No swapping occurs. > > > > - Disk I/O. "vmstat" tells me no block I/O happens during the > > benchmark. > > > > - As a reference, I can transfer 5 GB/s over a single TCP connection > > using plain sockets in a simple Python script. 3 GB/s over multiple > > connections doesn't look terrific. > > > > > > So it looks like there's a scalability issue inside our current Flight > > code, or perhaps inside gRPC. The benchmark itself, if simplistic, > > doesn't look problematic; it should actually be kind of a best case, > > especially with the above parameters. > > > > Does anyone have any clues or ideas? In particular, is there a simple > > way to diagnose *where* exactly the waiting times happen? > > > > Regards > > > > Antoine. > > >
Re: Flight / gRPC scalability issue
Hi Francois, It *should* work out of the box. I spent some time to make sure it does. Can you open a JIRA? I recommend using the grpc-cpp conda-forge package. Wes On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > Can you remind us what's the easiest way to get flight working with grpc? > clone + make install doesn't really work out of the box. > > François > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou > wrote: > > > > > Hello, > > > > I've been trying to saturate several CPU cores using our Flight > > benchmark (which spawns a server process and attempts to communicate > > with it using multiple clients), but haven't managed to. > > > > The typical command-line I'm executing is the following: > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > -records_per_batch 12 > > > > Breakdown: > > > > - "time": I want to get CPU user / system / wall-clock times > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > > allow scheduling RPC threads on 4 distinct physical cores > > > > - "-records_per_stream": I want each stream to have enough records so > > that connection / stream setup costs are negligible > > > > - "-num_streams": this is the number of streams the benchmark tries to > > download (DoGet()) from the server to the client > > > > - "-num_threads": this is the number of client threads the benchmark > > makes download requests from. Since our client is currently > > blocking, it makes sense to have a large number of client threads (to > > allow overlap). Note that each thread creates a separate gRPC client > > and connection. > > > > - "-records_per_batch": transfer enough records per individual RPC > > message, to minimize overhead. This number brings us close to the > > default gRPC message limit of 4 MB. > > > > The results I get look like: > > > > Bytes read: 256 > > Nanos: 8433804781 > > Speed: 2894.79 MB/s > > > > real0m8,569s > > user0m6,085s > > sys 0m15,667s > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > saturated by this benchmark. Evidently, this means that the benchmark > > is waiting a *lot*. The question is: where? > > > > Here is some things I looked at: > > > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > > > - number of threads used by the gRPC server. gRPC implicitly spawns a > > number of threads to handle incoming client requests. I've checked > > (using printf...) that several threads are indeed used to serve > > incoming connections. > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > > spent in memcpy() calls in the *client* (precisely, in the > > grpc_byte_buffer_reader_readall() call inside > > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > > like the server is the bottleneck. > > > > - the benchmark connects to "localhost". I've changed it to > > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > > connections should be well-optimized on Linux. It seems highly > > unlikely that they would incur idle waiting times (rather than CPU > > time processing packets). > > > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > > (server). No swapping occurs. > > > > - Disk I/O. "vmstat" tells me no block I/O happens during the > > benchmark. > > > > - As a reference, I can transfer 5 GB/s over a single TCP connection > > using plain sockets in a simple Python script. 3 GB/s over multiple > > connections doesn't look terrific. > > > > > > So it looks like there's a scalability issue inside our current Flight > > code, or perhaps inside gRPC. The benchmark itself, if simplistic, > > doesn't look problematic; it should actually be kind of a best case, > > especially with the above parameters. > > > > Does anyone have any clues or ideas? In particular, is there a simple > > way to diagnose *where* exactly the waiting times happen? > > > > Regards > > > > Antoine. > > >
Re: Flight / gRPC scalability issue
On Thu, 21 Feb 2019 11:02:58 -0500 Francois Saint-Jacques wrote: > Can you remind us what's the easiest way to get flight working with grpc? > clone + make install doesn't really work out of the box. You can install the "grpc-cpp" package from conda-forge. Our CMake configuration should pick it up automatically. Regards Antoine. > > François > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou wrote: > > > > > Hello, > > > > I've been trying to saturate several CPU cores using our Flight > > benchmark (which spawns a server process and attempts to communicate > > with it using multiple clients), but haven't managed to. > > > > The typical command-line I'm executing is the following: > > > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > > -records_per_stream 5000 -num_streams 16 -num_threads 32 > > -records_per_batch 12 > > > > Breakdown: > > > > - "time": I want to get CPU user / system / wall-clock times > > > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > > allow scheduling RPC threads on 4 distinct physical cores > > > > - "-records_per_stream": I want each stream to have enough records so > > that connection / stream setup costs are negligible > > > > - "-num_streams": this is the number of streams the benchmark tries to > > download (DoGet()) from the server to the client > > > > - "-num_threads": this is the number of client threads the benchmark > > makes download requests from. Since our client is currently > > blocking, it makes sense to have a large number of client threads (to > > allow overlap). Note that each thread creates a separate gRPC client > > and connection. > > > > - "-records_per_batch": transfer enough records per individual RPC > > message, to minimize overhead. This number brings us close to the > > default gRPC message limit of 4 MB. > > > > The results I get look like: > > > > Bytes read: 256 > > Nanos: 8433804781 > > Speed: 2894.79 MB/s > > > > real0m8,569s > > user0m6,085s > > sys 0m15,667s > > > > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > > saturated by this benchmark. Evidently, this means that the benchmark > > is waiting a *lot*. The question is: where? > > > > Here is some things I looked at: > > > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > > > - number of threads used by the gRPC server. gRPC implicitly spawns a > > number of threads to handle incoming client requests. I've checked > > (using printf...) that several threads are indeed used to serve > > incoming connections. > > > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > > spent in memcpy() calls in the *client* (precisely, in the > > grpc_byte_buffer_reader_readall() call inside > > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > > like the server is the bottleneck. > > > > - the benchmark connects to "localhost". I've changed it to > > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > > connections should be well-optimized on Linux. It seems highly > > unlikely that they would incur idle waiting times (rather than CPU > > time processing packets). > > > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > > (server). No swapping occurs. > > > > - Disk I/O. "vmstat" tells me no block I/O happens during the > > benchmark. > > > > - As a reference, I can transfer 5 GB/s over a single TCP connection > > using plain sockets in a simple Python script. 3 GB/s over multiple > > connections doesn't look terrific. > > > > > > So it looks like there's a scalability issue inside our current Flight > > code, or perhaps inside gRPC. The benchmark itself, if simplistic, > > doesn't look problematic; it should actually be kind of a best case, > > especially with the above parameters. > > > > Does anyone have any clues or ideas? In particular, is there a simple > > way to diagnose *where* exactly the waiting times happen? > > > > Regards > > > > Antoine. > > >
Re: Flight / gRPC scalability issue
Can you remind us what's the easiest way to get flight working with grpc? clone + make install doesn't really work out of the box. François On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou wrote: > > Hello, > > I've been trying to saturate several CPU cores using our Flight > benchmark (which spawns a server process and attempts to communicate > with it using multiple clients), but haven't managed to. > > The typical command-line I'm executing is the following: > > $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark > -records_per_stream 5000 -num_streams 16 -num_threads 32 > -records_per_batch 12 > > Breakdown: > > - "time": I want to get CPU user / system / wall-clock times > > - "taskset -c ...": I have a 8-core 16-threads machine and I want to > allow scheduling RPC threads on 4 distinct physical cores > > - "-records_per_stream": I want each stream to have enough records so > that connection / stream setup costs are negligible > > - "-num_streams": this is the number of streams the benchmark tries to > download (DoGet()) from the server to the client > > - "-num_threads": this is the number of client threads the benchmark > makes download requests from. Since our client is currently > blocking, it makes sense to have a large number of client threads (to > allow overlap). Note that each thread creates a separate gRPC client > and connection. > > - "-records_per_batch": transfer enough records per individual RPC > message, to minimize overhead. This number brings us close to the > default gRPC message limit of 4 MB. > > The results I get look like: > > Bytes read: 256 > Nanos: 8433804781 > Speed: 2894.79 MB/s > > real0m8,569s > user0m6,085s > sys 0m15,667s > > > If we divide (user + sys) by real, we conclude that 2.5 cores are > saturated by this benchmark. Evidently, this means that the benchmark > is waiting a *lot*. The question is: where? > > Here is some things I looked at: > > - mutex usage inside Arrow. None seems to pop up (printf is my friend). > > - number of threads used by the gRPC server. gRPC implicitly spawns a > number of threads to handle incoming client requests. I've checked > (using printf...) that several threads are indeed used to serve > incoming connections. > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is > spent in memcpy() calls in the *client* (precisely, in the > grpc_byte_buffer_reader_readall() call inside > arrow::flight::internal::FlightDataDeserialize()). It doesn't look > like the server is the bottleneck. > > - the benchmark connects to "localhost". I've changed it to > "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP > connections should be well-optimized on Linux. It seems highly > unlikely that they would incur idle waiting times (rather than CPU > time processing packets). > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB > (server). No swapping occurs. > > - Disk I/O. "vmstat" tells me no block I/O happens during the > benchmark. > > - As a reference, I can transfer 5 GB/s over a single TCP connection > using plain sockets in a simple Python script. 3 GB/s over multiple > connections doesn't look terrific. > > > So it looks like there's a scalability issue inside our current Flight > code, or perhaps inside gRPC. The benchmark itself, if simplistic, > doesn't look problematic; it should actually be kind of a best case, > especially with the above parameters. > > Does anyone have any clues or ideas? In particular, is there a simple > way to diagnose *where* exactly the waiting times happen? > > Regards > > Antoine. >