Re: Flight / gRPC scalability issue

2019-02-27 Thread Wes McKinney
It seems like this discussion would be relevant to the gRPC community.
There are probably other issues at play, like ensuring that multiple
streams through the same port do not block each other too much if one
stream has messages of smaller size and another larger size, then the
byte slices sent are broken up into smaller pieces. We may want to
make some improvements to gRPC to configure it to better suit our
performance requirements

On Sun, Feb 24, 2019 at 1:19 PM Antoine Pitrou  wrote:
>
>
> Le 24/02/2019 à 19:46, Wes McKinney a écrit :
> > OK, I don't know enough about sockets or networking to know what
> > hypothetical performance is possible with 16 concurrent packet streams
> > going through a single port (was the 5GB/s based on a single-threaded
> > or multithreaded benchmark? i.e. did it simulate the the equivalent
> > number / size / concurrency of packets that the Flight benchmark is
> > doing).
>
> The 5 GB/s uses just two threads: one server thread, one client thread.
>  The code is almost trivial, it's just careful to avoid spurious copies:
> https://gist.github.com/pitrou/86fd433a8f71b0052e29fddcf4d766be
>
> Regards
>
> Antoine.


Re: Flight / gRPC scalability issue

2019-02-24 Thread Antoine Pitrou


Le 24/02/2019 à 19:46, Wes McKinney a écrit :
> OK, I don't know enough about sockets or networking to know what
> hypothetical performance is possible with 16 concurrent packet streams
> going through a single port (was the 5GB/s based on a single-threaded
> or multithreaded benchmark? i.e. did it simulate the the equivalent
> number / size / concurrency of packets that the Flight benchmark is
> doing).

The 5 GB/s uses just two threads: one server thread, one client thread.
 The code is almost trivial, it's just careful to avoid spurious copies:
https://gist.github.com/pitrou/86fd433a8f71b0052e29fddcf4d766be

Regards

Antoine.


Re: Flight / gRPC scalability issue

2019-02-24 Thread Wes McKinney
OK, I don't know enough about sockets or networking to know what
hypothetical performance is possible with 16 concurrent packet streams
going through a single port (was the 5GB/s based on a single-threaded
or multithreaded benchmark? i.e. did it simulate the the equivalent
number / size / concurrency of packets that the Flight benchmark is
doing). If the CPU cores aren't being saturated then I guess IO is
blocking in some way. It might be best to involve folks from the gRPC
community who are more expert in this domain.

To me, > 20 GBit/sec seems like acceptable throughput, considering
that networking faster than 10 gigabit is relatively exotic. I don't
think that optimizing for > 10GBit network was even a short term goal
for Flight. For faster networks I would guess we're going to be
getting into RDMA for moving IPC payloads rather than going through
TCP

On Sun, Feb 24, 2019 at 12:23 PM Antoine Pitrou  wrote:
>
>
> Le 24/02/2019 à 18:35, Wes McKinney a écrit :
> > hi Antoine,
> >
> > All of the Flight traffic is going through a hard-coded single port
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185
> >
> > What happens if you spin up a different server (and use a different
> > port) for each thread? I'm surprised no one else has mentioned this
> > yet
>
> Well that's not the expected usage model for a server, either :-)  If
> you run an HTTP server, for example, you don't expect to have to open
> different ports on the same machine (rather than only port 80 or 443) to
> get good scalability.
>
> Regards
>
> Antoine.


Re: Flight / gRPC scalability issue

2019-02-24 Thread Antoine Pitrou


Le 24/02/2019 à 18:35, Wes McKinney a écrit :
> hi Antoine,
> 
> All of the Flight traffic is going through a hard-coded single port
> 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185
> 
> What happens if you spin up a different server (and use a different
> port) for each thread? I'm surprised no one else has mentioned this
> yet

Well that's not the expected usage model for a server, either :-)  If
you run an HTTP server, for example, you don't expect to have to open
different ports on the same machine (rather than only port 80 or 443) to
get good scalability.

Regards

Antoine.


Re: Flight / gRPC scalability issue

2019-02-24 Thread Wes McKinney
hi Antoine,

All of the Flight traffic is going through a hard-coded single port

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight-benchmark.cc#L185

What happens if you spin up a different server (and use a different
port) for each thread? I'm surprised no one else has mentioned this
yet

https://issues.apache.org/jira/browse/ARROW-3330

- Wes

On Sun, Feb 24, 2019 at 9:20 AM Antoine Pitrou  wrote:
>
>
> If that was the case, then we would see 100% CPU usage on all CPU cores,
> right?  Here my question is why only 2.5 cores are saturated while I'm
> pinning the benchmark to 4 physical cores.
>
> Regards
>
> Antoine.
>
>
> Le 24/02/2019 à 14:29, Francois Saint-Jacques a écrit :
> > A quick glance suggests you're limited by the kernel copying memory around
> > (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e).
> > I think the next step is to use different physical hosts for client and
> > server. This
> > way you'll free resources for the server.
> >
> > François
> >
> >
> > On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou  wrote:
> >
> >>
> >> We're talking about the BCC tools, which are not based on perf:
> >> https://github.com/iovisor/bcc/
> >>
> >> Apparently, using Linux perf for the same purpose is some kind of hassle
> >> (you need to write perf scripts?).
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit :
> >>> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to
> >> perf,
> >>> it'll help the unwinding. Sometimes it's better than the stack pointer
> >>> method since it keep track of inlined functions.
> >>>
> >>> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou 
> >> wrote:
> >>>
> 
>  Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
>  userspace stack traces properly (it probably needs all dependencies to
>  be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
>  lot of time is spent waiting for futextes, I don't know if that is for a
>  legitimate reason...
> 
>  Regards
> 
>  Antoine.
> 
> 
>  Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> > I was thinking of this variant:
> >
> > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> >
> > but I must admit that I haven't tried that technique myself.
> >
> >
> >
> > On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
> >
> >
> > I don't think that's the answer here.  The question is not how
> > to /visualize/ where time is spent waiting, but how to /measure/
> >> it.
> > Normal profiling only tells you where CPU time is spent, not what
> >> the
> > process is idly waiting for.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 21 Feb 2019 16:29:15 +
> > Hatem Helal  wrote:
> > > I like flamegraphs for investigating this sort of problem:
> > >
> > > https://github.com/brendangregg/FlameGraph
> > >
> > > There are likely many other techniques for inspecting where time
>  is being spent but that can at least help narrow down the search space.
> > >
> > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
>  fsaintjacq...@gmail.com> wrote:
> > >
> > > Can you remind us what's the easiest way to get flight
> >> working
>  with grpc?
> > > clone + make install doesn't really work out of the box.
> > >
> > > François
> > >
> > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
>  anto...@python.org> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > I've been trying to saturate several CPU cores using our
>  Flight
> > > > benchmark (which spawns a server process and attempts to
>  communicate
> > > > with it using multiple clients), but haven't managed to.
> > > >
> > > > The typical command-line I'm executing is the following:
> > > >
> > > > $ time taskset -c 1,3,5,7
>  ./build/release/arrow-flight-benchmark
> > > > -records_per_stream 5000 -num_streams 16 -num_threads
> >> 32
> > > > -records_per_batch 12
> > > >
> > > > Breakdown:
> > > >
> > > > - "time": I want to get CPU user / system / wall-clock
> >> times
> > > >
> > > > - "taskset -c ...": I have a 8-core 16-threads machine and
> >> I
>  want to
> > > >   allow scheduling RPC threads on 4 distinct physical cores
> > > >
> > > > - "-records_per_stream": I want each stream to have enough
>  records so
> > > >   that connection / stream setup costs are negligible
> > > >
> > > > - "-num_strea

Re: Flight / gRPC scalability issue

2019-02-24 Thread Antoine Pitrou


If that was the case, then we would see 100% CPU usage on all CPU cores,
right?  Here my question is why only 2.5 cores are saturated while I'm
pinning the benchmark to 4 physical cores.

Regards

Antoine.


Le 24/02/2019 à 14:29, Francois Saint-Jacques a écrit :
> A quick glance suggests you're limited by the kernel copying memory around
> (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e).
> I think the next step is to use different physical hosts for client and
> server. This
> way you'll free resources for the server.
> 
> François
> 
> 
> On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou  wrote:
> 
>>
>> We're talking about the BCC tools, which are not based on perf:
>> https://github.com/iovisor/bcc/
>>
>> Apparently, using Linux perf for the same purpose is some kind of hassle
>> (you need to write perf scripts?).
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit :
>>> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to
>> perf,
>>> it'll help the unwinding. Sometimes it's better than the stack pointer
>>> method since it keep track of inlined functions.
>>>
>>> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou 
>> wrote:
>>>

 Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
 userspace stack traces properly (it probably needs all dependencies to
 be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
 lot of time is spent waiting for futextes, I don't know if that is for a
 legitimate reason...

 Regards

 Antoine.


 Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> I was thinking of this variant:
>
> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
>
> but I must admit that I haven't tried that technique myself.
>
>
>
> On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
>
>
> I don't think that's the answer here.  The question is not how
> to /visualize/ where time is spent waiting, but how to /measure/
>> it.
> Normal profiling only tells you where CPU time is spent, not what
>> the
> process is idly waiting for.
>
> Regards
>
> Antoine.
>
>
> On Thu, 21 Feb 2019 16:29:15 +
> Hatem Helal  wrote:
> > I like flamegraphs for investigating this sort of problem:
> >
> > https://github.com/brendangregg/FlameGraph
> >
> > There are likely many other techniques for inspecting where time
 is being spent but that can at least help narrow down the search space.
> >
> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
 fsaintjacq...@gmail.com> wrote:
> >
> > Can you remind us what's the easiest way to get flight
>> working
 with grpc?
> > clone + make install doesn't really work out of the box.
> >
> > François
> >
> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
 anto...@python.org> wrote:
> >
> > >
> > > Hello,
> > >
> > > I've been trying to saturate several CPU cores using our
 Flight
> > > benchmark (which spawns a server process and attempts to
 communicate
> > > with it using multiple clients), but haven't managed to.
> > >
> > > The typical command-line I'm executing is the following:
> > >
> > > $ time taskset -c 1,3,5,7
 ./build/release/arrow-flight-benchmark
> > > -records_per_stream 5000 -num_streams 16 -num_threads
>> 32
> > > -records_per_batch 12
> > >
> > > Breakdown:
> > >
> > > - "time": I want to get CPU user / system / wall-clock
>> times
> > >
> > > - "taskset -c ...": I have a 8-core 16-threads machine and
>> I
 want to
> > >   allow scheduling RPC threads on 4 distinct physical cores
> > >
> > > - "-records_per_stream": I want each stream to have enough
 records so
> > >   that connection / stream setup costs are negligible
> > >
> > > - "-num_streams": this is the number of streams the
 benchmark tries to
> > >   download (DoGet()) from the server to the client
> > >
> > > - "-num_threads": this is the number of client threads the
 benchmark
> > >   makes download requests from.  Since our client is
 currently
> > >   blocking, it makes sense to have a large number of client
 threads (to
> > >   allow overlap).  Note that each thread creates a separate
 gRPC client
> > >   and connection.
> > >
> > > - "-records_per_batch": transfer enough records per
 individual RPC
> > >   me

Re: Flight / gRPC scalability issue

2019-02-24 Thread Francois Saint-Jacques
A quick glance suggests you're limited by the kernel copying memory around
(https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e).
I think the next step is to use different physical hosts for client and
server. This
way you'll free resources for the server.

François


On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou  wrote:

>
> We're talking about the BCC tools, which are not based on perf:
> https://github.com/iovisor/bcc/
>
> Apparently, using Linux perf for the same purpose is some kind of hassle
> (you need to write perf scripts?).
>
> Regards
>
> Antoine.
>
>
> Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit :
> > You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to
> perf,
> > it'll help the unwinding. Sometimes it's better than the stack pointer
> > method since it keep track of inlined functions.
> >
> > On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
> >> userspace stack traces properly (it probably needs all dependencies to
> >> be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
> >> lot of time is spent waiting for futextes, I don't know if that is for a
> >> legitimate reason...
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> >>> I was thinking of this variant:
> >>>
> >>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> >>>
> >>> but I must admit that I haven't tried that technique myself.
> >>>
> >>>
> >>>
> >>> On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
> >>>
> >>>
> >>> I don't think that's the answer here.  The question is not how
> >>> to /visualize/ where time is spent waiting, but how to /measure/
> it.
> >>> Normal profiling only tells you where CPU time is spent, not what
> the
> >>> process is idly waiting for.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> On Thu, 21 Feb 2019 16:29:15 +
> >>> Hatem Helal  wrote:
> >>> > I like flamegraphs for investigating this sort of problem:
> >>> >
> >>> > https://github.com/brendangregg/FlameGraph
> >>> >
> >>> > There are likely many other techniques for inspecting where time
> >> is being spent but that can at least help narrow down the search space.
> >>> >
> >>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
> >> fsaintjacq...@gmail.com> wrote:
> >>> >
> >>> > Can you remind us what's the easiest way to get flight
> working
> >> with grpc?
> >>> > clone + make install doesn't really work out of the box.
> >>> >
> >>> > François
> >>> >
> >>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
> >> anto...@python.org> wrote:
> >>> >
> >>> > >
> >>> > > Hello,
> >>> > >
> >>> > > I've been trying to saturate several CPU cores using our
> >> Flight
> >>> > > benchmark (which spawns a server process and attempts to
> >> communicate
> >>> > > with it using multiple clients), but haven't managed to.
> >>> > >
> >>> > > The typical command-line I'm executing is the following:
> >>> > >
> >>> > > $ time taskset -c 1,3,5,7
> >> ./build/release/arrow-flight-benchmark
> >>> > > -records_per_stream 5000 -num_streams 16 -num_threads
> 32
> >>> > > -records_per_batch 12
> >>> > >
> >>> > > Breakdown:
> >>> > >
> >>> > > - "time": I want to get CPU user / system / wall-clock
> times
> >>> > >
> >>> > > - "taskset -c ...": I have a 8-core 16-threads machine and
> I
> >> want to
> >>> > >   allow scheduling RPC threads on 4 distinct physical cores
> >>> > >
> >>> > > - "-records_per_stream": I want each stream to have enough
> >> records so
> >>> > >   that connection / stream setup costs are negligible
> >>> > >
> >>> > > - "-num_streams": this is the number of streams the
> >> benchmark tries to
> >>> > >   download (DoGet()) from the server to the client
> >>> > >
> >>> > > - "-num_threads": this is the number of client threads the
> >> benchmark
> >>> > >   makes download requests from.  Since our client is
> >> currently
> >>> > >   blocking, it makes sense to have a large number of client
> >> threads (to
> >>> > >   allow overlap).  Note that each thread creates a separate
> >> gRPC client
> >>> > >   and connection.
> >>> > >
> >>> > > - "-records_per_batch": transfer enough records per
> >> individual RPC
> >>> > >   message, to minimize overhead.  This number brings us
> >> close to the
> >>> > >   default gRPC message limit of 4 MB.
> >>> > >
> >>> > > The results I get look like:
> >>> > >
> >>> > > Bytes read: 256
> >>> > > Nanos: 8433804781
> >>> > > Spee

Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


We're talking about the BCC tools, which are not based on perf:
https://github.com/iovisor/bcc/

Apparently, using Linux perf for the same purpose is some kind of hassle
(you need to write perf scripts?).

Regards

Antoine.


Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit :
> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to perf,
> it'll help the unwinding. Sometimes it's better than the stack pointer
> method since it keep track of inlined functions.
> 
> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou  wrote:
> 
>>
>> Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
>> userspace stack traces properly (it probably needs all dependencies to
>> be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
>> lot of time is spent waiting for futextes, I don't know if that is for a
>> legitimate reason...
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 21/02/2019 à 17:52, Hatem Helal a écrit :
>>> I was thinking of this variant:
>>>
>>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
>>>
>>> but I must admit that I haven't tried that technique myself.
>>>
>>>
>>>
>>> On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
>>>
>>>
>>> I don't think that's the answer here.  The question is not how
>>> to /visualize/ where time is spent waiting, but how to /measure/ it.
>>> Normal profiling only tells you where CPU time is spent, not what the
>>> process is idly waiting for.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> On Thu, 21 Feb 2019 16:29:15 +
>>> Hatem Helal  wrote:
>>> > I like flamegraphs for investigating this sort of problem:
>>> >
>>> > https://github.com/brendangregg/FlameGraph
>>> >
>>> > There are likely many other techniques for inspecting where time
>> is being spent but that can at least help narrow down the search space.
>>> >
>>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
>> fsaintjacq...@gmail.com> wrote:
>>> >
>>> > Can you remind us what's the easiest way to get flight working
>> with grpc?
>>> > clone + make install doesn't really work out of the box.
>>> >
>>> > François
>>> >
>>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
>> anto...@python.org> wrote:
>>> >
>>> > >
>>> > > Hello,
>>> > >
>>> > > I've been trying to saturate several CPU cores using our
>> Flight
>>> > > benchmark (which spawns a server process and attempts to
>> communicate
>>> > > with it using multiple clients), but haven't managed to.
>>> > >
>>> > > The typical command-line I'm executing is the following:
>>> > >
>>> > > $ time taskset -c 1,3,5,7
>> ./build/release/arrow-flight-benchmark
>>> > > -records_per_stream 5000 -num_streams 16 -num_threads 32
>>> > > -records_per_batch 12
>>> > >
>>> > > Breakdown:
>>> > >
>>> > > - "time": I want to get CPU user / system / wall-clock times
>>> > >
>>> > > - "taskset -c ...": I have a 8-core 16-threads machine and I
>> want to
>>> > >   allow scheduling RPC threads on 4 distinct physical cores
>>> > >
>>> > > - "-records_per_stream": I want each stream to have enough
>> records so
>>> > >   that connection / stream setup costs are negligible
>>> > >
>>> > > - "-num_streams": this is the number of streams the
>> benchmark tries to
>>> > >   download (DoGet()) from the server to the client
>>> > >
>>> > > - "-num_threads": this is the number of client threads the
>> benchmark
>>> > >   makes download requests from.  Since our client is
>> currently
>>> > >   blocking, it makes sense to have a large number of client
>> threads (to
>>> > >   allow overlap).  Note that each thread creates a separate
>> gRPC client
>>> > >   and connection.
>>> > >
>>> > > - "-records_per_batch": transfer enough records per
>> individual RPC
>>> > >   message, to minimize overhead.  This number brings us
>> close to the
>>> > >   default gRPC message limit of 4 MB.
>>> > >
>>> > > The results I get look like:
>>> > >
>>> > > Bytes read: 256
>>> > > Nanos: 8433804781
>>> > > Speed: 2894.79 MB/s
>>> > >
>>> > > real0m8,569s
>>> > > user0m6,085s
>>> > > sys 0m15,667s
>>> > >
>>> > >
>>> > > If we divide (user + sys) by real, we conclude that 2.5
>> cores are
>>> > > saturated by this benchmark.  Evidently, this means that the
>> benchmark
>>> > > is waiting a *lot*.  The question is: where?
>>> > >
>>> > > Here is some things I looked at:
>>> > >
>>> > > - mutex usage inside Arrow.  None seems to pop up (printf is
>> my friend).
>>> > >
>>> > > -

Re: Flight / gRPC scalability issue

2019-02-21 Thread Francois Saint-Jacques
You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to perf,
it'll help the unwinding. Sometimes it's better than the stack pointer
method since it keep track of inlined functions.

On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou  wrote:

>
> Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
> userspace stack traces properly (it probably needs all dependencies to
> be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
> lot of time is spent waiting for futextes, I don't know if that is for a
> legitimate reason...
>
> Regards
>
> Antoine.
>
>
> Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> > I was thinking of this variant:
> >
> > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> >
> > but I must admit that I haven't tried that technique myself.
> >
> >
> >
> > On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
> >
> >
> > I don't think that's the answer here.  The question is not how
> > to /visualize/ where time is spent waiting, but how to /measure/ it.
> > Normal profiling only tells you where CPU time is spent, not what the
> > process is idly waiting for.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 21 Feb 2019 16:29:15 +
> > Hatem Helal  wrote:
> > > I like flamegraphs for investigating this sort of problem:
> > >
> > > https://github.com/brendangregg/FlameGraph
> > >
> > > There are likely many other techniques for inspecting where time
> is being spent but that can at least help narrow down the search space.
> > >
> > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
> fsaintjacq...@gmail.com> wrote:
> > >
> > > Can you remind us what's the easiest way to get flight working
> with grpc?
> > > clone + make install doesn't really work out of the box.
> > >
> > > François
> > >
> > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
> anto...@python.org> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > I've been trying to saturate several CPU cores using our
> Flight
> > > > benchmark (which spawns a server process and attempts to
> communicate
> > > > with it using multiple clients), but haven't managed to.
> > > >
> > > > The typical command-line I'm executing is the following:
> > > >
> > > > $ time taskset -c 1,3,5,7
> ./build/release/arrow-flight-benchmark
> > > > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > > > -records_per_batch 12
> > > >
> > > > Breakdown:
> > > >
> > > > - "time": I want to get CPU user / system / wall-clock times
> > > >
> > > > - "taskset -c ...": I have a 8-core 16-threads machine and I
> want to
> > > >   allow scheduling RPC threads on 4 distinct physical cores
> > > >
> > > > - "-records_per_stream": I want each stream to have enough
> records so
> > > >   that connection / stream setup costs are negligible
> > > >
> > > > - "-num_streams": this is the number of streams the
> benchmark tries to
> > > >   download (DoGet()) from the server to the client
> > > >
> > > > - "-num_threads": this is the number of client threads the
> benchmark
> > > >   makes download requests from.  Since our client is
> currently
> > > >   blocking, it makes sense to have a large number of client
> threads (to
> > > >   allow overlap).  Note that each thread creates a separate
> gRPC client
> > > >   and connection.
> > > >
> > > > - "-records_per_batch": transfer enough records per
> individual RPC
> > > >   message, to minimize overhead.  This number brings us
> close to the
> > > >   default gRPC message limit of 4 MB.
> > > >
> > > > The results I get look like:
> > > >
> > > > Bytes read: 256
> > > > Nanos: 8433804781
> > > > Speed: 2894.79 MB/s
> > > >
> > > > real0m8,569s
> > > > user0m6,085s
> > > > sys 0m15,667s
> > > >
> > > >
> > > > If we divide (user + sys) by real, we conclude that 2.5
> cores are
> > > > saturated by this benchmark.  Evidently, this means that the
> benchmark
> > > > is waiting a *lot*.  The question is: where?
> > > >
> > > > Here is some things I looked at:
> > > >
> > > > - mutex usage inside Arrow.  None seems to pop up (printf is
> my friend).
> > > >
> > > > - number of threads used by the gRPC server.  gRPC
> implicitly spawns a
> > > >   number of threads to handle incoming client requests.
> I've checked
> > > >   (using printf...) that several threads are indeed used to
> serve
> > > >   incoming connections.
> > > >
> > > > - CPU usage b

Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
userspace stack traces properly (it probably needs all dependencies to
be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
lot of time is spent waiting for futextes, I don't know if that is for a
legitimate reason...

Regards

Antoine.


Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> I was thinking of this variant:
> 
> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> 
> but I must admit that I haven't tried that technique myself.
> 
> 
> 
> On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
> 
> 
> I don't think that's the answer here.  The question is not how
> to /visualize/ where time is spent waiting, but how to /measure/ it.
> Normal profiling only tells you where CPU time is spent, not what the
> process is idly waiting for.
> 
> Regards
> 
> Antoine.
> 
> 
> On Thu, 21 Feb 2019 16:29:15 +
> Hatem Helal  wrote:
> > I like flamegraphs for investigating this sort of problem:
> > 
> > https://github.com/brendangregg/FlameGraph
> > 
> > There are likely many other techniques for inspecting where time is 
> being spent but that can at least help narrow down the search space.
> > 
> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" 
>  wrote:
> > 
> > Can you remind us what's the easiest way to get flight working with 
> grpc?
> > clone + make install doesn't really work out of the box.
> > 
> > François
> > 
> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
>  wrote:
> > 
> > >
> > > Hello,
> > >
> > > I've been trying to saturate several CPU cores using our Flight
> > > benchmark (which spawns a server process and attempts to 
> communicate
> > > with it using multiple clients), but haven't managed to.
> > >
> > > The typical command-line I'm executing is the following:
> > >
> > > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > > -records_per_batch 12
> > >
> > > Breakdown:
> > >
> > > - "time": I want to get CPU user / system / wall-clock times
> > >
> > > - "taskset -c ...": I have a 8-core 16-threads machine and I want 
> to
> > >   allow scheduling RPC threads on 4 distinct physical cores
> > >
> > > - "-records_per_stream": I want each stream to have enough 
> records so
> > >   that connection / stream setup costs are negligible
> > >
> > > - "-num_streams": this is the number of streams the benchmark 
> tries to
> > >   download (DoGet()) from the server to the client
> > >
> > > - "-num_threads": this is the number of client threads the 
> benchmark
> > >   makes download requests from.  Since our client is currently
> > >   blocking, it makes sense to have a large number of client 
> threads (to
> > >   allow overlap).  Note that each thread creates a separate gRPC 
> client
> > >   and connection.
> > >
> > > - "-records_per_batch": transfer enough records per individual RPC
> > >   message, to minimize overhead.  This number brings us close to 
> the
> > >   default gRPC message limit of 4 MB.
> > >
> > > The results I get look like:
> > >
> > > Bytes read: 256
> > > Nanos: 8433804781
> > > Speed: 2894.79 MB/s
> > >
> > > real0m8,569s
> > > user0m6,085s
> > > sys 0m15,667s
> > >
> > >
> > > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > > saturated by this benchmark.  Evidently, this means that the 
> benchmark
> > > is waiting a *lot*.  The question is: where?
> > >
> > > Here is some things I looked at:
> > >
> > > - mutex usage inside Arrow.  None seems to pop up (printf is my 
> friend).
> > >
> > > - number of threads used by the gRPC server.  gRPC implicitly 
> spawns a
> > >   number of threads to handle incoming client requests.  I've 
> checked
> > >   (using printf...) that several threads are indeed used to serve
> > >   incoming connections.
> > >
> > > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time 
> is
> > >   spent in memcpy() calls in the *client* (precisely, in the
> > >   grpc_byte_buffer_reader_readall() call inside
> > >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't 
> look
> > >   like the server is the bottleneck.
> > >
> > > - the benchmark connects to "localhost".  I've changed it to
> > >   "127.0.0.1", it doesn't m

Re: Flight / gRPC scalability issue

2019-02-21 Thread Hatem Helal
I was thinking of this variant:

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

but I must admit that I haven't tried that technique myself.



On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:


I don't think that's the answer here.  The question is not how
to /visualize/ where time is spent waiting, but how to /measure/ it.
Normal profiling only tells you where CPU time is spent, not what the
process is idly waiting for.

Regards

Antoine.


On Thu, 21 Feb 2019 16:29:15 +
Hatem Helal  wrote:
> I like flamegraphs for investigating this sort of problem:
> 
> https://github.com/brendangregg/FlameGraph
> 
> There are likely many other techniques for inspecting where time is being 
spent but that can at least help narrow down the search space.
> 
> On 2/21/19, 4:03 PM, "Francois Saint-Jacques"  
wrote:
> 
> Can you remind us what's the easiest way to get flight working with 
grpc?
> clone + make install doesn't really work out of the box.
> 
> François
> 
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  
wrote:
> 
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records 
so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries 
to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads 
(to
> >   allow overlap).  Note that each thread creates a separate gRPC 
client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the 
benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my 
friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly 
spawns a
> >   number of threads to handle incoming client requests.  I've 
checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't 
look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python s

Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


I don't think that's the answer here.  The question is not how
to /visualize/ where time is spent waiting, but how to /measure/ it.
Normal profiling only tells you where CPU time is spent, not what the
process is idly waiting for.

Regards

Antoine.


On Thu, 21 Feb 2019 16:29:15 +
Hatem Helal  wrote:
> I like flamegraphs for investigating this sort of problem:
> 
> https://github.com/brendangregg/FlameGraph
> 
> There are likely many other techniques for inspecting where time is being 
> spent but that can at least help narrow down the search space.
> 
> On 2/21/19, 4:03 PM, "Francois Saint-Jacques"  
> wrote:
> 
> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.
> 
> François
> 
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  
> wrote:
> 
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >  
> 
> 





Re: Flight / gRPC scalability issue

2019-02-21 Thread Hatem Helal
I like flamegraphs for investigating this sort of problem:

https://github.com/brendangregg/FlameGraph

There are likely many other techniques for inspecting where time is being spent 
but that can at least help narrow down the search space.

On 2/21/19, 4:03 PM, "Francois Saint-Jacques"  wrote:

Can you remind us what's the easiest way to get flight working with grpc?
clone + make install doesn't really work out of the box.

François

On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  wrote:

>
> Hello,
>
> I've been trying to saturate several CPU cores using our Flight
> benchmark (which spawns a server process and attempts to communicate
> with it using multiple clients), but haven't managed to.
>
> The typical command-line I'm executing is the following:
>
> $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> -records_per_stream 5000 -num_streams 16 -num_threads 32
> -records_per_batch 12
>
> Breakdown:
>
> - "time": I want to get CPU user / system / wall-clock times
>
> - "taskset -c ...": I have a 8-core 16-threads machine and I want to
>   allow scheduling RPC threads on 4 distinct physical cores
>
> - "-records_per_stream": I want each stream to have enough records so
>   that connection / stream setup costs are negligible
>
> - "-num_streams": this is the number of streams the benchmark tries to
>   download (DoGet()) from the server to the client
>
> - "-num_threads": this is the number of client threads the benchmark
>   makes download requests from.  Since our client is currently
>   blocking, it makes sense to have a large number of client threads (to
>   allow overlap).  Note that each thread creates a separate gRPC client
>   and connection.
>
> - "-records_per_batch": transfer enough records per individual RPC
>   message, to minimize overhead.  This number brings us close to the
>   default gRPC message limit of 4 MB.
>
> The results I get look like:
>
> Bytes read: 256
> Nanos: 8433804781
> Speed: 2894.79 MB/s
>
> real0m8,569s
> user0m6,085s
> sys 0m15,667s
>
>
> If we divide (user + sys) by real, we conclude that 2.5 cores are
> saturated by this benchmark.  Evidently, this means that the benchmark
> is waiting a *lot*.  The question is: where?
>
> Here is some things I looked at:
>
> - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
>
> - number of threads used by the gRPC server.  gRPC implicitly spawns a
>   number of threads to handle incoming client requests.  I've checked
>   (using printf...) that several threads are indeed used to serve
>   incoming connections.
>
> - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
>   spent in memcpy() calls in the *client* (precisely, in the
>   grpc_byte_buffer_reader_readall() call inside
>   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
>   like the server is the bottleneck.
>
> - the benchmark connects to "localhost".  I've changed it to
>   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
>   connections should be well-optimized on Linux.  It seems highly
>   unlikely that they would incur idle waiting times (rather than CPU
>   time processing packets).
>
> - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
>   (server).  No swapping occurs.
>
> - Disk I/O.  "vmstat" tells me no block I/O happens during the
>   benchmark.
>
> - As a reference, I can transfer 5 GB/s over a single TCP connection
>   using plain sockets in a simple Python script.  3 GB/s over multiple
>   connections doesn't look terrific.
>
>
> So it looks like there's a scalability issue inside our current Flight
> code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> doesn't look problematic; it should actually be kind of a best case,
> especially with the above parameters.
>
> Does anyone have any clues or ideas?  In particular, is there a simple
> way to diagnose *where* exactly the waiting times happen?
>
> Regards
>
> Antoine.
>




Re: Flight / gRPC scalability issue

2019-02-21 Thread Hatem Helal
I like flamegraphs for investigating this sort of problem:

https://github.com/brendangregg/FlameGraph

There are likely many other techniques for inspecting where time is being spent 
but that can at least help narrow down the search space.


On 2/21/19, 4:29 PM, "Wes McKinney"  wrote:

Hi Francois,

It *should* work out of the box. I spent some time to make sure it does.
Can you open a JIRA?

I recommend using the grpc-cpp conda-forge package.

Wes

On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.
>
> François
>
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
> wrote:
>
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >
>




Re: Flight / gRPC scalability issue

2019-02-21 Thread Wes McKinney
Hi Francois,

It *should* work out of the box. I spent some time to make sure it does.
Can you open a JIRA?

I recommend using the grpc-cpp conda-forge package.

Wes

On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.
>
> François
>
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
> wrote:
>
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >
>


Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou
On Thu, 21 Feb 2019 11:02:58 -0500
Francois Saint-Jacques  wrote:
> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.

You can install the "grpc-cpp" package from conda-forge.  Our CMake
configuration should pick it up automatically.

Regards

Antoine.


> 
> François
> 
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  wrote:
> 
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >  
> 



Re: Flight / gRPC scalability issue

2019-02-21 Thread Francois Saint-Jacques
Can you remind us what's the easiest way to get flight working with grpc?
clone + make install doesn't really work out of the box.

François

On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  wrote:

>
> Hello,
>
> I've been trying to saturate several CPU cores using our Flight
> benchmark (which spawns a server process and attempts to communicate
> with it using multiple clients), but haven't managed to.
>
> The typical command-line I'm executing is the following:
>
> $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> -records_per_stream 5000 -num_streams 16 -num_threads 32
> -records_per_batch 12
>
> Breakdown:
>
> - "time": I want to get CPU user / system / wall-clock times
>
> - "taskset -c ...": I have a 8-core 16-threads machine and I want to
>   allow scheduling RPC threads on 4 distinct physical cores
>
> - "-records_per_stream": I want each stream to have enough records so
>   that connection / stream setup costs are negligible
>
> - "-num_streams": this is the number of streams the benchmark tries to
>   download (DoGet()) from the server to the client
>
> - "-num_threads": this is the number of client threads the benchmark
>   makes download requests from.  Since our client is currently
>   blocking, it makes sense to have a large number of client threads (to
>   allow overlap).  Note that each thread creates a separate gRPC client
>   and connection.
>
> - "-records_per_batch": transfer enough records per individual RPC
>   message, to minimize overhead.  This number brings us close to the
>   default gRPC message limit of 4 MB.
>
> The results I get look like:
>
> Bytes read: 256
> Nanos: 8433804781
> Speed: 2894.79 MB/s
>
> real0m8,569s
> user0m6,085s
> sys 0m15,667s
>
>
> If we divide (user + sys) by real, we conclude that 2.5 cores are
> saturated by this benchmark.  Evidently, this means that the benchmark
> is waiting a *lot*.  The question is: where?
>
> Here is some things I looked at:
>
> - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
>
> - number of threads used by the gRPC server.  gRPC implicitly spawns a
>   number of threads to handle incoming client requests.  I've checked
>   (using printf...) that several threads are indeed used to serve
>   incoming connections.
>
> - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
>   spent in memcpy() calls in the *client* (precisely, in the
>   grpc_byte_buffer_reader_readall() call inside
>   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
>   like the server is the bottleneck.
>
> - the benchmark connects to "localhost".  I've changed it to
>   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
>   connections should be well-optimized on Linux.  It seems highly
>   unlikely that they would incur idle waiting times (rather than CPU
>   time processing packets).
>
> - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
>   (server).  No swapping occurs.
>
> - Disk I/O.  "vmstat" tells me no block I/O happens during the
>   benchmark.
>
> - As a reference, I can transfer 5 GB/s over a single TCP connection
>   using plain sockets in a simple Python script.  3 GB/s over multiple
>   connections doesn't look terrific.
>
>
> So it looks like there's a scalability issue inside our current Flight
> code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> doesn't look problematic; it should actually be kind of a best case,
> especially with the above parameters.
>
> Does anyone have any clues or ideas?  In particular, is there a simple
> way to diagnose *where* exactly the waiting times happen?
>
> Regards
>
> Antoine.
>