hi Antoine,

All of the Flight traffic is going through a hard-coded single port

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185

What happens if you spin up a different server (and use a different
port) for each thread? I'm surprised no one else has mentioned this
yet

https://issues.apache.org/jira/browse/ARROW-3330

- Wes

On Sun, Feb 24, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> If that was the case, then we would see 100% CPU usage on all CPU cores,
> right?  Here my question is why only 2.5 cores are saturated while I'm
> pinning the benchmark to 4 physical cores.
>
> Regards
>
> Antoine.
>
>
> Le 24/02/2019 à 14:29, Francois Saint-Jacques a écrit :
> > A quick glance suggests you're limited by the kernel copying memory around
> > (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e).
> > I think the next step is to use different physical hosts for client and
> > server. This
> > way you'll free resources for the server.
> >
> > François
> >
> >
> > On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou <anto...@python.org> wrote:
> >
> >>
> >> We're talking about the BCC tools, which are not based on perf:
> >> https://github.com/iovisor/bcc/
> >>
> >> Apparently, using Linux perf for the same purpose is some kind of hassle
> >> (you need to write perf scripts?).
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit :
> >>> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to
> >> perf,
> >>> it'll help the unwinding. Sometimes it's better than the stack pointer
> >>> method since it keep track of inlined functions.
> >>>
> >>> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
> >>>> userspace stack traces properly (it probably needs all dependencies to
> >>>> be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
> >>>> lot of time is spent waiting for futextes, I don't know if that is for a
> >>>> legitimate reason...
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> >>>>> I was thinking of this variant:
> >>>>>
> >>>>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> >>>>>
> >>>>> but I must admit that I haven't tried that technique myself.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/21/19, 4:41 PM, "Antoine Pitrou" <solip...@pitrou.net> wrote:
> >>>>>
> >>>>>
> >>>>>     I don't think that's the answer here.  The question is not how
> >>>>>     to /visualize/ where time is spent waiting, but how to /measure/
> >> it.
> >>>>>     Normal profiling only tells you where CPU time is spent, not what
> >> the
> >>>>>     process is idly waiting for.
> >>>>>
> >>>>>     Regards
> >>>>>
> >>>>>     Antoine.
> >>>>>
> >>>>>
> >>>>>     On Thu, 21 Feb 2019 16:29:15 +0000
> >>>>>     Hatem Helal <hhe...@mathworks.com> wrote:
> >>>>>     > I like flamegraphs for investigating this sort of problem:
> >>>>>     >
> >>>>>     > https://github.com/brendangregg/FlameGraph
> >>>>>     >
> >>>>>     > There are likely many other techniques for inspecting where time
> >>>> is being spent but that can at least help narrow down the search space.
> >>>>>     >
> >>>>>     > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
> >>>> fsaintjacq...@gmail.com> wrote:
> >>>>>     >
> >>>>>     >     Can you remind us what's the easiest way to get flight
> >> working
> >>>> with grpc?
> >>>>>     >     clone + make install doesn't really work out of the box.
> >>>>>     >
> >>>>>     >     François
> >>>>>     >
> >>>>>     >     On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
> >>>> anto...@python.org> wrote:
> >>>>>     >
> >>>>>     >     >
> >>>>>     >     > Hello,
> >>>>>     >     >
> >>>>>     >     > I've been trying to saturate several CPU cores using our
> >>>> Flight
> >>>>>     >     > benchmark (which spawns a server process and attempts to
> >>>> communicate
> >>>>>     >     > with it using multiple clients), but haven't managed to.
> >>>>>     >     >
> >>>>>     >     > The typical command-line I'm executing is the following:
> >>>>>     >     >
> >>>>>     >     > $ time taskset -c 1,3,5,7
> >>>> ./build/release/arrow-flight-benchmark
> >>>>>     >     > -records_per_stream 50000000 -num_streams 16 -num_threads
> >> 32
> >>>>>     >     > -records_per_batch 120000
> >>>>>     >     >
> >>>>>     >     > Breakdown:
> >>>>>     >     >
> >>>>>     >     > - "time": I want to get CPU user / system / wall-clock
> >> times
> >>>>>     >     >
> >>>>>     >     > - "taskset -c ...": I have a 8-core 16-threads machine and
> >> I
> >>>> want to
> >>>>>     >     >   allow scheduling RPC threads on 4 distinct physical cores
> >>>>>     >     >
> >>>>>     >     > - "-records_per_stream": I want each stream to have enough
> >>>> records so
> >>>>>     >     >   that connection / stream setup costs are negligible
> >>>>>     >     >
> >>>>>     >     > - "-num_streams": this is the number of streams the
> >>>> benchmark tries to
> >>>>>     >     >   download (DoGet()) from the server to the client
> >>>>>     >     >
> >>>>>     >     > - "-num_threads": this is the number of client threads the
> >>>> benchmark
> >>>>>     >     >   makes download requests from.  Since our client is
> >>>> currently
> >>>>>     >     >   blocking, it makes sense to have a large number of client
> >>>> threads (to
> >>>>>     >     >   allow overlap).  Note that each thread creates a separate
> >>>> gRPC client
> >>>>>     >     >   and connection.
> >>>>>     >     >
> >>>>>     >     > - "-records_per_batch": transfer enough records per
> >>>> individual RPC
> >>>>>     >     >   message, to minimize overhead.  This number brings us
> >>>> close to the
> >>>>>     >     >   default gRPC message limit of 4 MB.
> >>>>>     >     >
> >>>>>     >     > The results I get look like:
> >>>>>     >     >
> >>>>>     >     > Bytes read: 25600000000
> >>>>>     >     > Nanos: 8433804781
> >>>>>     >     > Speed: 2894.79 MB/s
> >>>>>     >     >
> >>>>>     >     > real    0m8,569s
> >>>>>     >     > user    0m6,085s
> >>>>>     >     > sys     0m15,667s
> >>>>>     >     >
> >>>>>     >     >
> >>>>>     >     > If we divide (user + sys) by real, we conclude that 2.5
> >>>> cores are
> >>>>>     >     > saturated by this benchmark.  Evidently, this means that
> >> the
> >>>> benchmark
> >>>>>     >     > is waiting a *lot*.  The question is: where?
> >>>>>     >     >
> >>>>>     >     > Here is some things I looked at:
> >>>>>     >     >
> >>>>>     >     > - mutex usage inside Arrow.  None seems to pop up (printf
> >> is
> >>>> my friend).
> >>>>>     >     >
> >>>>>     >     > - number of threads used by the gRPC server.  gRPC
> >>>> implicitly spawns a
> >>>>>     >     >   number of threads to handle incoming client requests.
> >>>> I've checked
> >>>>>     >     >   (using printf...) that several threads are indeed used to
> >>>> serve
> >>>>>     >     >   incoming connections.
> >>>>>     >     >
> >>>>>     >     > - CPU usage bottlenecks.  80% of the entire benchmark's CPU
> >>>> time is
> >>>>>     >     >   spent in memcpy() calls in the *client* (precisely, in
> >> the
> >>>>>     >     >   grpc_byte_buffer_reader_readall() call inside
> >>>>>     >     >   arrow::flight::internal::FlightDataDeserialize()).  It
> >>>> doesn't look
> >>>>>     >     >   like the server is the bottleneck.
> >>>>>     >     >
> >>>>>     >     > - the benchmark connects to "localhost".  I've changed it
> >> to
> >>>>>     >     >   "127.0.0.1", it doesn't make a difference.  AFAIK,
> >>>> localhost TCP
> >>>>>     >     >   connections should be well-optimized on Linux.  It seems
> >>>> highly
> >>>>>     >     >   unlikely that they would incur idle waiting times (rather
> >>>> than CPU
> >>>>>     >     >   time processing packets).
> >>>>>     >     >
> >>>>>     >     > - RAM usage.  It's quite reasonable at 220 MB (client) + 75
> >>>> MB
> >>>>>     >     >   (server).  No swapping occurs.
> >>>>>     >     >
> >>>>>     >     > - Disk I/O.  "vmstat" tells me no block I/O happens during
> >>>> the
> >>>>>     >     >   benchmark.
> >>>>>     >     >
> >>>>>     >     > - As a reference, I can transfer 5 GB/s over a single TCP
> >>>> connection
> >>>>>     >     >   using plain sockets in a simple Python script.  3 GB/s
> >>>> over multiple
> >>>>>     >     >   connections doesn't look terrific.
> >>>>>     >     >
> >>>>>     >     >
> >>>>>     >     > So it looks like there's a scalability issue inside our
> >>>> current Flight
> >>>>>     >     > code, or perhaps inside gRPC.  The benchmark itself, if
> >>>> simplistic,
> >>>>>     >     > doesn't look problematic; it should actually be kind of a
> >>>> best case,
> >>>>>     >     > especially with the above parameters.
> >>>>>     >     >
> >>>>>     >     > Does anyone have any clues or ideas?  In particular, is
> >>>> there a simple
> >>>>>     >     > way to diagnose *where* exactly the waiting times happen?
> >>>>>     >     >
> >>>>>     >     > Regards
> >>>>>     >     >
> >>>>>     >     > Antoine.
> >>>>>     >     >
> >>>>>     >
> >>>>>     >
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >

Reply via email to