[jira] [Created] (ARROW-4656) [Rust] Implement CSV Writer

2019-02-21 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-4656:
--

 Summary: [Rust] Implement CSV Writer
 Key: ARROW-4656
 URL: https://issues.apache.org/jira/browse/ARROW-4656
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4655) [Packaging] Parallelize binary upload

2019-02-21 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4655:
---

 Summary: [Packaging] Parallelize binary upload
 Key: ARROW-4655
 URL: https://issues.apache.org/jira/browse/ARROW-4655
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4654) [C++] Implicit Flight target dependencies cause compilation failure

2019-02-21 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4654:
-

 Summary: [C++] Implicit Flight target dependencies cause 
compilation failure
 Key: ARROW-4654
 URL: https://issues.apache.org/jira/browse/ARROW-4654
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Affects Versions: 0.12.0
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques



{code:sh}
In file included from ../src/arrow/flight/internal.h:23:0,
 from ../src/arrow/python/flight.cc:20:
../src/arrow/flight/protocol-internal.h:22:10: fatal error: 
arrow/flight/Flight.grpc.pb.h: No such file or directory
 #include "arrow/flight/Flight.grpc.pb.h"  // IWYU pragma: export
  ^~
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4653) [C++] decimal multiply broken when both args are negative

2019-02-21 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-4653:
-

 Summary: [C++] decimal multiply broken when both args are negative
 Key: ARROW-4653
 URL: https://issues.apache.org/jira/browse/ARROW-4653
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


We're talking about the BCC tools, which are not based on perf:
https://github.com/iovisor/bcc/

Apparently, using Linux perf for the same purpose is some kind of hassle
(you need to write perf scripts?).

Regards

Antoine.


Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit :
> You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to perf,
> it'll help the unwinding. Sometimes it's better than the stack pointer
> method since it keep track of inlined functions.
> 
> On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou  wrote:
> 
>>
>> Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
>> userspace stack traces properly (it probably needs all dependencies to
>> be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
>> lot of time is spent waiting for futextes, I don't know if that is for a
>> legitimate reason...
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 21/02/2019 à 17:52, Hatem Helal a écrit :
>>> I was thinking of this variant:
>>>
>>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
>>>
>>> but I must admit that I haven't tried that technique myself.
>>>
>>>
>>>
>>> On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
>>>
>>>
>>> I don't think that's the answer here.  The question is not how
>>> to /visualize/ where time is spent waiting, but how to /measure/ it.
>>> Normal profiling only tells you where CPU time is spent, not what the
>>> process is idly waiting for.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> On Thu, 21 Feb 2019 16:29:15 +
>>> Hatem Helal  wrote:
>>> > I like flamegraphs for investigating this sort of problem:
>>> >
>>> > https://github.com/brendangregg/FlameGraph
>>> >
>>> > There are likely many other techniques for inspecting where time
>> is being spent but that can at least help narrow down the search space.
>>> >
>>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
>> fsaintjacq...@gmail.com> wrote:
>>> >
>>> > Can you remind us what's the easiest way to get flight working
>> with grpc?
>>> > clone + make install doesn't really work out of the box.
>>> >
>>> > François
>>> >
>>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
>> anto...@python.org> wrote:
>>> >
>>> > >
>>> > > Hello,
>>> > >
>>> > > I've been trying to saturate several CPU cores using our
>> Flight
>>> > > benchmark (which spawns a server process and attempts to
>> communicate
>>> > > with it using multiple clients), but haven't managed to.
>>> > >
>>> > > The typical command-line I'm executing is the following:
>>> > >
>>> > > $ time taskset -c 1,3,5,7
>> ./build/release/arrow-flight-benchmark
>>> > > -records_per_stream 5000 -num_streams 16 -num_threads 32
>>> > > -records_per_batch 12
>>> > >
>>> > > Breakdown:
>>> > >
>>> > > - "time": I want to get CPU user / system / wall-clock times
>>> > >
>>> > > - "taskset -c ...": I have a 8-core 16-threads machine and I
>> want to
>>> > >   allow scheduling RPC threads on 4 distinct physical cores
>>> > >
>>> > > - "-records_per_stream": I want each stream to have enough
>> records so
>>> > >   that connection / stream setup costs are negligible
>>> > >
>>> > > - "-num_streams": this is the number of streams the
>> benchmark tries to
>>> > >   download (DoGet()) from the server to the client
>>> > >
>>> > > - "-num_threads": this is the number of client threads the
>> benchmark
>>> > >   makes download requests from.  Since our client is
>> currently
>>> > >   blocking, it makes sense to have a large number of client
>> threads (to
>>> > >   allow overlap).  Note that each thread creates a separate
>> gRPC client
>>> > >   and connection.
>>> > >
>>> > > - "-records_per_batch": transfer enough records per
>> individual RPC
>>> > >   message, to minimize overhead.  This number brings us
>> close to the
>>> > >   default gRPC message limit of 4 MB.
>>> > >
>>> > > The results I get look like:
>>> > >
>>> > > Bytes read: 256
>>> > > Nanos: 8433804781
>>> > > Speed: 2894.79 MB/s
>>> > >
>>> > > real0m8,569s
>>> > > user0m6,085s
>>> > > sys 0m15,667s
>>> > >
>>> > >
>>> > > If we divide (user + sys) by real, we conclude that 2.5
>> cores are
>>> > > saturated by this benchmark.  Evidently, this means that the
>> benchmark
>>> > > is waiting a *lot*.  The question is: where?
>>> > >
>>> > > Here is some things I looked at:
>>> > >
>>> > > - mutex usage inside Arrow.  None seems to pop up (printf is
>> my friend).
>>> > >
>>> > > -

Re: Flight / gRPC scalability issue

2019-02-21 Thread Francois Saint-Jacques
You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to perf,
it'll help the unwinding. Sometimes it's better than the stack pointer
method since it keep track of inlined functions.

On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou  wrote:

>
> Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
> userspace stack traces properly (it probably needs all dependencies to
> be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
> lot of time is spent waiting for futextes, I don't know if that is for a
> legitimate reason...
>
> Regards
>
> Antoine.
>
>
> Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> > I was thinking of this variant:
> >
> > http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> >
> > but I must admit that I haven't tried that technique myself.
> >
> >
> >
> > On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
> >
> >
> > I don't think that's the answer here.  The question is not how
> > to /visualize/ where time is spent waiting, but how to /measure/ it.
> > Normal profiling only tells you where CPU time is spent, not what the
> > process is idly waiting for.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 21 Feb 2019 16:29:15 +
> > Hatem Helal  wrote:
> > > I like flamegraphs for investigating this sort of problem:
> > >
> > > https://github.com/brendangregg/FlameGraph
> > >
> > > There are likely many other techniques for inspecting where time
> is being spent but that can at least help narrow down the search space.
> > >
> > > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" <
> fsaintjacq...@gmail.com> wrote:
> > >
> > > Can you remind us what's the easiest way to get flight working
> with grpc?
> > > clone + make install doesn't really work out of the box.
> > >
> > > François
> > >
> > > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou <
> anto...@python.org> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > I've been trying to saturate several CPU cores using our
> Flight
> > > > benchmark (which spawns a server process and attempts to
> communicate
> > > > with it using multiple clients), but haven't managed to.
> > > >
> > > > The typical command-line I'm executing is the following:
> > > >
> > > > $ time taskset -c 1,3,5,7
> ./build/release/arrow-flight-benchmark
> > > > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > > > -records_per_batch 12
> > > >
> > > > Breakdown:
> > > >
> > > > - "time": I want to get CPU user / system / wall-clock times
> > > >
> > > > - "taskset -c ...": I have a 8-core 16-threads machine and I
> want to
> > > >   allow scheduling RPC threads on 4 distinct physical cores
> > > >
> > > > - "-records_per_stream": I want each stream to have enough
> records so
> > > >   that connection / stream setup costs are negligible
> > > >
> > > > - "-num_streams": this is the number of streams the
> benchmark tries to
> > > >   download (DoGet()) from the server to the client
> > > >
> > > > - "-num_threads": this is the number of client threads the
> benchmark
> > > >   makes download requests from.  Since our client is
> currently
> > > >   blocking, it makes sense to have a large number of client
> threads (to
> > > >   allow overlap).  Note that each thread creates a separate
> gRPC client
> > > >   and connection.
> > > >
> > > > - "-records_per_batch": transfer enough records per
> individual RPC
> > > >   message, to minimize overhead.  This number brings us
> close to the
> > > >   default gRPC message limit of 4 MB.
> > > >
> > > > The results I get look like:
> > > >
> > > > Bytes read: 256
> > > > Nanos: 8433804781
> > > > Speed: 2894.79 MB/s
> > > >
> > > > real0m8,569s
> > > > user0m6,085s
> > > > sys 0m15,667s
> > > >
> > > >
> > > > If we divide (user + sys) by real, we conclude that 2.5
> cores are
> > > > saturated by this benchmark.  Evidently, this means that the
> benchmark
> > > > is waiting a *lot*.  The question is: where?
> > > >
> > > > Here is some things I looked at:
> > > >
> > > > - mutex usage inside Arrow.  None seems to pop up (printf is
> my friend).
> > > >
> > > > - number of threads used by the gRPC server.  gRPC
> implicitly spawns a
> > > >   number of threads to handle incoming client requests.
> I've checked
> > > >   (using printf...) that several threads are indeed used to
> serve
> > > >   incoming connections.
> > > >
> > > > - CPU usage b

Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


Ah, thanks.  I'm trying it now.  The problem is that it doesn't record
userspace stack traces properly (it probably needs all dependencies to
be recompiled with -fno-omit-frame-pointer :-/).  So while I know that a
lot of time is spent waiting for futextes, I don't know if that is for a
legitimate reason...

Regards

Antoine.


Le 21/02/2019 à 17:52, Hatem Helal a écrit :
> I was thinking of this variant:
> 
> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> 
> but I must admit that I haven't tried that technique myself.
> 
> 
> 
> On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:
> 
> 
> I don't think that's the answer here.  The question is not how
> to /visualize/ where time is spent waiting, but how to /measure/ it.
> Normal profiling only tells you where CPU time is spent, not what the
> process is idly waiting for.
> 
> Regards
> 
> Antoine.
> 
> 
> On Thu, 21 Feb 2019 16:29:15 +
> Hatem Helal  wrote:
> > I like flamegraphs for investigating this sort of problem:
> > 
> > https://github.com/brendangregg/FlameGraph
> > 
> > There are likely many other techniques for inspecting where time is 
> being spent but that can at least help narrow down the search space.
> > 
> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" 
>  wrote:
> > 
> > Can you remind us what's the easiest way to get flight working with 
> grpc?
> > clone + make install doesn't really work out of the box.
> > 
> > François
> > 
> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
>  wrote:
> > 
> > >
> > > Hello,
> > >
> > > I've been trying to saturate several CPU cores using our Flight
> > > benchmark (which spawns a server process and attempts to 
> communicate
> > > with it using multiple clients), but haven't managed to.
> > >
> > > The typical command-line I'm executing is the following:
> > >
> > > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > > -records_per_batch 12
> > >
> > > Breakdown:
> > >
> > > - "time": I want to get CPU user / system / wall-clock times
> > >
> > > - "taskset -c ...": I have a 8-core 16-threads machine and I want 
> to
> > >   allow scheduling RPC threads on 4 distinct physical cores
> > >
> > > - "-records_per_stream": I want each stream to have enough 
> records so
> > >   that connection / stream setup costs are negligible
> > >
> > > - "-num_streams": this is the number of streams the benchmark 
> tries to
> > >   download (DoGet()) from the server to the client
> > >
> > > - "-num_threads": this is the number of client threads the 
> benchmark
> > >   makes download requests from.  Since our client is currently
> > >   blocking, it makes sense to have a large number of client 
> threads (to
> > >   allow overlap).  Note that each thread creates a separate gRPC 
> client
> > >   and connection.
> > >
> > > - "-records_per_batch": transfer enough records per individual RPC
> > >   message, to minimize overhead.  This number brings us close to 
> the
> > >   default gRPC message limit of 4 MB.
> > >
> > > The results I get look like:
> > >
> > > Bytes read: 256
> > > Nanos: 8433804781
> > > Speed: 2894.79 MB/s
> > >
> > > real0m8,569s
> > > user0m6,085s
> > > sys 0m15,667s
> > >
> > >
> > > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > > saturated by this benchmark.  Evidently, this means that the 
> benchmark
> > > is waiting a *lot*.  The question is: where?
> > >
> > > Here is some things I looked at:
> > >
> > > - mutex usage inside Arrow.  None seems to pop up (printf is my 
> friend).
> > >
> > > - number of threads used by the gRPC server.  gRPC implicitly 
> spawns a
> > >   number of threads to handle incoming client requests.  I've 
> checked
> > >   (using printf...) that several threads are indeed used to serve
> > >   incoming connections.
> > >
> > > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time 
> is
> > >   spent in memcpy() calls in the *client* (precisely, in the
> > >   grpc_byte_buffer_reader_readall() call inside
> > >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't 
> look
> > >   like the server is the bottleneck.
> > >
> > > - the benchmark connects to "localhost".  I've changed it to
> > >   "127.0.0.1", it doesn't m

[jira] [Created] (ARROW-4652) [JS] RecordBatchReader throughNode should respect autoDestroy

2019-02-21 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4652:
--

 Summary: [JS] RecordBatchReader throughNode should respect 
autoDestroy
 Key: ARROW-4652
 URL: https://issues.apache.org/jira/browse/ARROW-4652
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


The Reader transform stream closes after reading one set of tables even when 
autoDestroy is false. Instead it should reset/reopen the reader, like 
{{RecordBatchReader.readAll()}} does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flight / gRPC scalability issue

2019-02-21 Thread Hatem Helal
I was thinking of this variant:

http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

but I must admit that I haven't tried that technique myself.



On 2/21/19, 4:41 PM, "Antoine Pitrou"  wrote:


I don't think that's the answer here.  The question is not how
to /visualize/ where time is spent waiting, but how to /measure/ it.
Normal profiling only tells you where CPU time is spent, not what the
process is idly waiting for.

Regards

Antoine.


On Thu, 21 Feb 2019 16:29:15 +
Hatem Helal  wrote:
> I like flamegraphs for investigating this sort of problem:
> 
> https://github.com/brendangregg/FlameGraph
> 
> There are likely many other techniques for inspecting where time is being 
spent but that can at least help narrow down the search space.
> 
> On 2/21/19, 4:03 PM, "Francois Saint-Jacques"  
wrote:
> 
> Can you remind us what's the easiest way to get flight working with 
grpc?
> clone + make install doesn't really work out of the box.
> 
> François
> 
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  
wrote:
> 
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records 
so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries 
to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads 
(to
> >   allow overlap).  Note that each thread creates a separate gRPC 
client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the 
benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my 
friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly 
spawns a
> >   number of threads to handle incoming client requests.  I've 
checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't 
look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python s

Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


I don't think that's the answer here.  The question is not how
to /visualize/ where time is spent waiting, but how to /measure/ it.
Normal profiling only tells you where CPU time is spent, not what the
process is idly waiting for.

Regards

Antoine.


On Thu, 21 Feb 2019 16:29:15 +
Hatem Helal  wrote:
> I like flamegraphs for investigating this sort of problem:
> 
> https://github.com/brendangregg/FlameGraph
> 
> There are likely many other techniques for inspecting where time is being 
> spent but that can at least help narrow down the search space.
> 
> On 2/21/19, 4:03 PM, "Francois Saint-Jacques"  
> wrote:
> 
> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.
> 
> François
> 
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  
> wrote:
> 
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >  
> 
> 





Re: Flight / gRPC scalability issue

2019-02-21 Thread Hatem Helal
I like flamegraphs for investigating this sort of problem:

https://github.com/brendangregg/FlameGraph

There are likely many other techniques for inspecting where time is being spent 
but that can at least help narrow down the search space.

On 2/21/19, 4:03 PM, "Francois Saint-Jacques"  wrote:

Can you remind us what's the easiest way to get flight working with grpc?
clone + make install doesn't really work out of the box.

François

On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  wrote:

>
> Hello,
>
> I've been trying to saturate several CPU cores using our Flight
> benchmark (which spawns a server process and attempts to communicate
> with it using multiple clients), but haven't managed to.
>
> The typical command-line I'm executing is the following:
>
> $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> -records_per_stream 5000 -num_streams 16 -num_threads 32
> -records_per_batch 12
>
> Breakdown:
>
> - "time": I want to get CPU user / system / wall-clock times
>
> - "taskset -c ...": I have a 8-core 16-threads machine and I want to
>   allow scheduling RPC threads on 4 distinct physical cores
>
> - "-records_per_stream": I want each stream to have enough records so
>   that connection / stream setup costs are negligible
>
> - "-num_streams": this is the number of streams the benchmark tries to
>   download (DoGet()) from the server to the client
>
> - "-num_threads": this is the number of client threads the benchmark
>   makes download requests from.  Since our client is currently
>   blocking, it makes sense to have a large number of client threads (to
>   allow overlap).  Note that each thread creates a separate gRPC client
>   and connection.
>
> - "-records_per_batch": transfer enough records per individual RPC
>   message, to minimize overhead.  This number brings us close to the
>   default gRPC message limit of 4 MB.
>
> The results I get look like:
>
> Bytes read: 256
> Nanos: 8433804781
> Speed: 2894.79 MB/s
>
> real0m8,569s
> user0m6,085s
> sys 0m15,667s
>
>
> If we divide (user + sys) by real, we conclude that 2.5 cores are
> saturated by this benchmark.  Evidently, this means that the benchmark
> is waiting a *lot*.  The question is: where?
>
> Here is some things I looked at:
>
> - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
>
> - number of threads used by the gRPC server.  gRPC implicitly spawns a
>   number of threads to handle incoming client requests.  I've checked
>   (using printf...) that several threads are indeed used to serve
>   incoming connections.
>
> - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
>   spent in memcpy() calls in the *client* (precisely, in the
>   grpc_byte_buffer_reader_readall() call inside
>   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
>   like the server is the bottleneck.
>
> - the benchmark connects to "localhost".  I've changed it to
>   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
>   connections should be well-optimized on Linux.  It seems highly
>   unlikely that they would incur idle waiting times (rather than CPU
>   time processing packets).
>
> - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
>   (server).  No swapping occurs.
>
> - Disk I/O.  "vmstat" tells me no block I/O happens during the
>   benchmark.
>
> - As a reference, I can transfer 5 GB/s over a single TCP connection
>   using plain sockets in a simple Python script.  3 GB/s over multiple
>   connections doesn't look terrific.
>
>
> So it looks like there's a scalability issue inside our current Flight
> code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> doesn't look problematic; it should actually be kind of a best case,
> especially with the above parameters.
>
> Does anyone have any clues or ideas?  In particular, is there a simple
> way to diagnose *where* exactly the waiting times happen?
>
> Regards
>
> Antoine.
>




Re: Flight / gRPC scalability issue

2019-02-21 Thread Hatem Helal
I like flamegraphs for investigating this sort of problem:

https://github.com/brendangregg/FlameGraph

There are likely many other techniques for inspecting where time is being spent 
but that can at least help narrow down the search space.


On 2/21/19, 4:29 PM, "Wes McKinney"  wrote:

Hi Francois,

It *should* work out of the box. I spent some time to make sure it does.
Can you open a JIRA?

I recommend using the grpc-cpp conda-forge package.

Wes

On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.
>
> François
>
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
> wrote:
>
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >
>




Re: Flight / gRPC scalability issue

2019-02-21 Thread Wes McKinney
Hi Francois,

It *should* work out of the box. I spent some time to make sure it does.
Can you open a JIRA?

I recommend using the grpc-cpp conda-forge package.

Wes

On Thu, Feb 21, 2019, 11:03 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.
>
> François
>
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou 
> wrote:
>
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >
>


Re: Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou
On Thu, 21 Feb 2019 11:02:58 -0500
Francois Saint-Jacques  wrote:
> Can you remind us what's the easiest way to get flight working with grpc?
> clone + make install doesn't really work out of the box.

You can install the "grpc-cpp" package from conda-forge.  Our CMake
configuration should pick it up automatically.

Regards

Antoine.


> 
> François
> 
> On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  wrote:
> 
> >
> > Hello,
> >
> > I've been trying to saturate several CPU cores using our Flight
> > benchmark (which spawns a server process and attempts to communicate
> > with it using multiple clients), but haven't managed to.
> >
> > The typical command-line I'm executing is the following:
> >
> > $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> > -records_per_stream 5000 -num_streams 16 -num_threads 32
> > -records_per_batch 12
> >
> > Breakdown:
> >
> > - "time": I want to get CPU user / system / wall-clock times
> >
> > - "taskset -c ...": I have a 8-core 16-threads machine and I want to
> >   allow scheduling RPC threads on 4 distinct physical cores
> >
> > - "-records_per_stream": I want each stream to have enough records so
> >   that connection / stream setup costs are negligible
> >
> > - "-num_streams": this is the number of streams the benchmark tries to
> >   download (DoGet()) from the server to the client
> >
> > - "-num_threads": this is the number of client threads the benchmark
> >   makes download requests from.  Since our client is currently
> >   blocking, it makes sense to have a large number of client threads (to
> >   allow overlap).  Note that each thread creates a separate gRPC client
> >   and connection.
> >
> > - "-records_per_batch": transfer enough records per individual RPC
> >   message, to minimize overhead.  This number brings us close to the
> >   default gRPC message limit of 4 MB.
> >
> > The results I get look like:
> >
> > Bytes read: 256
> > Nanos: 8433804781
> > Speed: 2894.79 MB/s
> >
> > real0m8,569s
> > user0m6,085s
> > sys 0m15,667s
> >
> >
> > If we divide (user + sys) by real, we conclude that 2.5 cores are
> > saturated by this benchmark.  Evidently, this means that the benchmark
> > is waiting a *lot*.  The question is: where?
> >
> > Here is some things I looked at:
> >
> > - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
> >
> > - number of threads used by the gRPC server.  gRPC implicitly spawns a
> >   number of threads to handle incoming client requests.  I've checked
> >   (using printf...) that several threads are indeed used to serve
> >   incoming connections.
> >
> > - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
> >   spent in memcpy() calls in the *client* (precisely, in the
> >   grpc_byte_buffer_reader_readall() call inside
> >   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
> >   like the server is the bottleneck.
> >
> > - the benchmark connects to "localhost".  I've changed it to
> >   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
> >   connections should be well-optimized on Linux.  It seems highly
> >   unlikely that they would incur idle waiting times (rather than CPU
> >   time processing packets).
> >
> > - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
> >   (server).  No swapping occurs.
> >
> > - Disk I/O.  "vmstat" tells me no block I/O happens during the
> >   benchmark.
> >
> > - As a reference, I can transfer 5 GB/s over a single TCP connection
> >   using plain sockets in a simple Python script.  3 GB/s over multiple
> >   connections doesn't look terrific.
> >
> >
> > So it looks like there's a scalability issue inside our current Flight
> > code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> > doesn't look problematic; it should actually be kind of a best case,
> > especially with the above parameters.
> >
> > Does anyone have any clues or ideas?  In particular, is there a simple
> > way to diagnose *where* exactly the waiting times happen?
> >
> > Regards
> >
> > Antoine.
> >  
> 



[jira] [Created] (ARROW-4651) [Format] Flight Location should be more flexible than a (host, port) pair

2019-02-21 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4651:
-

 Summary: [Format] Flight Location should be more flexible than a 
(host, port) pair
 Key: ARROW-4651
 URL: https://issues.apache.org/jira/browse/ARROW-4651
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Format
Affects Versions: 0.12.0
Reporter: Antoine Pitrou


The more future-proof solution is probably to define a URI format. gRPC already 
has something like that, though we might want to define our own format:
https://grpc.io/grpc/cpp/md_doc_naming.html




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flight / gRPC scalability issue

2019-02-21 Thread Francois Saint-Jacques
Can you remind us what's the easiest way to get flight working with grpc?
clone + make install doesn't really work out of the box.

François

On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou  wrote:

>
> Hello,
>
> I've been trying to saturate several CPU cores using our Flight
> benchmark (which spawns a server process and attempts to communicate
> with it using multiple clients), but haven't managed to.
>
> The typical command-line I'm executing is the following:
>
> $ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
> -records_per_stream 5000 -num_streams 16 -num_threads 32
> -records_per_batch 12
>
> Breakdown:
>
> - "time": I want to get CPU user / system / wall-clock times
>
> - "taskset -c ...": I have a 8-core 16-threads machine and I want to
>   allow scheduling RPC threads on 4 distinct physical cores
>
> - "-records_per_stream": I want each stream to have enough records so
>   that connection / stream setup costs are negligible
>
> - "-num_streams": this is the number of streams the benchmark tries to
>   download (DoGet()) from the server to the client
>
> - "-num_threads": this is the number of client threads the benchmark
>   makes download requests from.  Since our client is currently
>   blocking, it makes sense to have a large number of client threads (to
>   allow overlap).  Note that each thread creates a separate gRPC client
>   and connection.
>
> - "-records_per_batch": transfer enough records per individual RPC
>   message, to minimize overhead.  This number brings us close to the
>   default gRPC message limit of 4 MB.
>
> The results I get look like:
>
> Bytes read: 256
> Nanos: 8433804781
> Speed: 2894.79 MB/s
>
> real0m8,569s
> user0m6,085s
> sys 0m15,667s
>
>
> If we divide (user + sys) by real, we conclude that 2.5 cores are
> saturated by this benchmark.  Evidently, this means that the benchmark
> is waiting a *lot*.  The question is: where?
>
> Here is some things I looked at:
>
> - mutex usage inside Arrow.  None seems to pop up (printf is my friend).
>
> - number of threads used by the gRPC server.  gRPC implicitly spawns a
>   number of threads to handle incoming client requests.  I've checked
>   (using printf...) that several threads are indeed used to serve
>   incoming connections.
>
> - CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
>   spent in memcpy() calls in the *client* (precisely, in the
>   grpc_byte_buffer_reader_readall() call inside
>   arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
>   like the server is the bottleneck.
>
> - the benchmark connects to "localhost".  I've changed it to
>   "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
>   connections should be well-optimized on Linux.  It seems highly
>   unlikely that they would incur idle waiting times (rather than CPU
>   time processing packets).
>
> - RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
>   (server).  No swapping occurs.
>
> - Disk I/O.  "vmstat" tells me no block I/O happens during the
>   benchmark.
>
> - As a reference, I can transfer 5 GB/s over a single TCP connection
>   using plain sockets in a simple Python script.  3 GB/s over multiple
>   connections doesn't look terrific.
>
>
> So it looks like there's a scalability issue inside our current Flight
> code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
> doesn't look problematic; it should actually be kind of a best case,
> especially with the above parameters.
>
> Does anyone have any clues or ideas?  In particular, is there a simple
> way to diagnose *where* exactly the waiting times happen?
>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-4650) The patch for PARQUET-1508 leads to infinite loop and infinite memory allocation when reading very sparse ByteArray columns

2019-02-21 Thread Valery Meleshkin (JIRA)
Valery Meleshkin created ARROW-4650:
---

 Summary: The patch for PARQUET-1508 leads to infinite loop and 
infinite memory allocation when reading very sparse ByteArray columns
 Key: ARROW-4650
 URL: https://issues.apache.org/jira/browse/ARROW-4650
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Valery Meleshkin


In this loop

[https://github.com/apache/arrow/commit/3d435e4f8d5fb7a54a4a9d285e1a42d60186d8dc#diff-47fe879cb9baad6c633c55f0a34a09c3R739]

The branch of if dealing with null values does not increment variable 'i'. 
Therefore on chunks containing only NULLs once a thread enters the loop, it 
stays in that loop forever. I'm not entirely sure whether 'num_values' variable 
was meant to be the number of non-NULL values, yet the total number of values 
is passed here 
[https://github.com/apache/arrow/blob/3d435e4f8d5fb7a54a4a9d285e1a42d60186d8dc/cpp/src/parquet/arrow/record_reader.cc#L528]

 

On my local machine adding `++i` to the NULL-handling branch seems to fix the 
problem. Unfortunately, I'm not familiar with the codebase enough to be certain 
it's a proper fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Flight / gRPC scalability issue

2019-02-21 Thread Antoine Pitrou


Hello,

I've been trying to saturate several CPU cores using our Flight
benchmark (which spawns a server process and attempts to communicate
with it using multiple clients), but haven't managed to.

The typical command-line I'm executing is the following:

$ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
-records_per_stream 5000 -num_streams 16 -num_threads 32
-records_per_batch 12

Breakdown:

- "time": I want to get CPU user / system / wall-clock times

- "taskset -c ...": I have a 8-core 16-threads machine and I want to
  allow scheduling RPC threads on 4 distinct physical cores

- "-records_per_stream": I want each stream to have enough records so
  that connection / stream setup costs are negligible

- "-num_streams": this is the number of streams the benchmark tries to
  download (DoGet()) from the server to the client

- "-num_threads": this is the number of client threads the benchmark
  makes download requests from.  Since our client is currently
  blocking, it makes sense to have a large number of client threads (to
  allow overlap).  Note that each thread creates a separate gRPC client
  and connection.

- "-records_per_batch": transfer enough records per individual RPC
  message, to minimize overhead.  This number brings us close to the
  default gRPC message limit of 4 MB.

The results I get look like:

Bytes read: 256
Nanos: 8433804781
Speed: 2894.79 MB/s

real0m8,569s
user0m6,085s
sys 0m15,667s


If we divide (user + sys) by real, we conclude that 2.5 cores are
saturated by this benchmark.  Evidently, this means that the benchmark
is waiting a *lot*.  The question is: where?

Here is some things I looked at:

- mutex usage inside Arrow.  None seems to pop up (printf is my friend).

- number of threads used by the gRPC server.  gRPC implicitly spawns a
  number of threads to handle incoming client requests.  I've checked
  (using printf...) that several threads are indeed used to serve
  incoming connections.

- CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
  spent in memcpy() calls in the *client* (precisely, in the
  grpc_byte_buffer_reader_readall() call inside
  arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
  like the server is the bottleneck.

- the benchmark connects to "localhost".  I've changed it to
  "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
  connections should be well-optimized on Linux.  It seems highly
  unlikely that they would incur idle waiting times (rather than CPU
  time processing packets).

- RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
  (server).  No swapping occurs.

- Disk I/O.  "vmstat" tells me no block I/O happens during the
  benchmark.

- As a reference, I can transfer 5 GB/s over a single TCP connection
  using plain sockets in a simple Python script.  3 GB/s over multiple
  connections doesn't look terrific.


So it looks like there's a scalability issue inside our current Flight
code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
doesn't look problematic; it should actually be kind of a best case,
especially with the above parameters.

Does anyone have any clues or ideas?  In particular, is there a simple
way to diagnose *where* exactly the waiting times happen?

Regards

Antoine.


[jira] [Created] (ARROW-4649) [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`

2019-02-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4649:
--

 Summary: [C++/CI/R] Add nightly job that builds `brew install 
apache-arrow --HEAD`
 Key: ARROW-4649
 URL: https://issues.apache.org/jira/browse/ARROW-4649
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration, R
Reporter: Uwe L. Korn
 Fix For: 0.13.0


Now that we have an Arrow homebrew formula again and we may want to have it as 
a simple setup for R Arrow users, we should add a nightly crossbow task that 
checks whether this still builds fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4648) [C++/Question] Naming/organizational inconsistencies in cpp codebase

2019-02-21 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4648:
--

 Summary: [C++/Question] Naming/organizational inconsistencies in 
cpp codebase
 Key: ARROW-4648
 URL: https://issues.apache.org/jira/browse/ARROW-4648
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs


Even after my eyes are used to the codebase, I still find the namings and/or 
code organization inconsistent.

h2. File Formats

So arrow already support a couple of file formats, namely parquet, feather, 
json, csv, orc, but their placement in the codebase is quiet odd:
- parquet: src/parquet
- feather: src/arrow/ipc/feather
- orc: src/arrow/adapters/orc
- csv: src/arrow/csv
- json: src/arrow/json
I might misunderstand the purpose of these sources, but I'd expect them to be 
organized under the same roof.

h2. Inter-Process-Communication vs. Flight

I'd expect flight's functionality from the ipc names. 

Flight's placement is a bit odd too, because it has its own codename, it should 
be placed under cpp/src - like parquet, plasma, or gandiva.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4647) [Packaging] dev/release/00-prepare.sh fails for minor version changes

2019-02-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4647:
--

 Summary: [Packaging] dev/release/00-prepare.sh fails for minor 
version changes
 Key: ARROW-4647
 URL: https://issues.apache.org/jira/browse/ARROW-4647
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Affects Versions: 0.12.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


When the next version is only on the patch level, we don't need to move the 
debian libraries to a different suffix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4645) [C++/Packaging] Ship Gandiva with OSX and Windows wheels

2019-02-21 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4645:
--

 Summary: [C++/Packaging] Ship Gandiva with OSX and Windows wheels
 Key: ARROW-4645
 URL: https://issues.apache.org/jira/browse/ARROW-4645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Packaging
Reporter: Krisztian Szucs


Gandiva is only installed via the linux wheels, We should support it on all 
platforms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4644) [C++/Docker] Build Gandiva in the docker containers

2019-02-21 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4644:
--

 Summary: [C++/Docker] Build Gandiva in the docker containers
 Key: ARROW-4644
 URL: https://issues.apache.org/jira/browse/ARROW-4644
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Krisztian Szucs


Install LLVM dependency and enable it:

https://github.com/apache/arrow/pull/3484/files#diff-1f2ebc25efb8f1e6646cbd31ce2f34f4R51



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4646) [C++/Packaging] Ship gandiva with the conda-forge packages

2019-02-21 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4646:
--

 Summary: [C++/Packaging] Ship gandiva with the conda-forge packages
 Key: ARROW-4646
 URL: https://issues.apache.org/jira/browse/ARROW-4646
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Krisztian Szucs


Gandiva is not yet built with the conda packages:
https://github.com/conda-forge/arrow-cpp-feedstock/blob/master/recipe/build.sh



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)