Re: [JAVA] Arrow performance measurement

2018-12-01 Thread Animesh Trivedi
Hi Wes, sure. I opened a ticket and will do a pull request.

Cheers,
--
Animesh

On Fri, Nov 30, 2018 at 5:28 PM Wes McKinney  wrote:

> hi Animesh -- can you link to JIRA issues about the C++ improvements
> you're describing? Want to make sure this doesn't fall through the
> cracks
>
> Thanks
> Wes
> On Mon, Nov 26, 2018 at 7:54 AM Antoine Pitrou  wrote:
> >
> >
> > Hi Animesh,
> >
> > Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> > >
> > > * C++ bitmap code can be optimized further by using the unsigned
> integers
> > > than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> > > https://godbolt.org/z/deq0_q - compare the size of the assembly code.
> And
> > > the performance measurements in the blog show up to 50% performance
> gains.
> > > Alternatively if signed to unsigned upgrade is not possible (perhaps in
> > > every language), then in the C++ code, we should use the bitmap
> operations
> > > directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> > > operation), instead of `/` and `%`.
> >
> > Thank you for noticing this.  Switching to unsigned (and do a
> > static_cast to unsigned) sounds good to me.
> >
> > Regards
> >
> > Antoine.
>


Re: [JAVA] Arrow performance measurement

2018-11-30 Thread Wes McKinney
hi Animesh -- can you link to JIRA issues about the C++ improvements
you're describing? Want to make sure this doesn't fall through the
cracks

Thanks
Wes
On Mon, Nov 26, 2018 at 7:54 AM Antoine Pitrou  wrote:
>
>
> Hi Animesh,
>
> Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> >
> > * C++ bitmap code can be optimized further by using the unsigned integers
> > than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> > https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
> > the performance measurements in the blog show up to 50% performance gains.
> > Alternatively if signed to unsigned upgrade is not possible (perhaps in
> > every language), then in the C++ code, we should use the bitmap operations
> > directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> > operation), instead of `/` and `%`.
>
> Thank you for noticing this.  Switching to unsigned (and do a
> static_cast to unsigned) sounds good to me.
>
> Regards
>
> Antoine.


Re: [JAVA] Arrow performance measurement

2018-11-26 Thread Antoine Pitrou


Hi Animesh,

Le 26/11/2018 à 14:23, Animesh Trivedi a écrit :
> 
> * C++ bitmap code can be optimized further by using the unsigned integers
> than "int64_t" for bitmap checks, and eliminating the kBitmap. See here
> https://godbolt.org/z/deq0_q - compare the size of the assembly code. And
> the performance measurements in the blog show up to 50% performance gains.
> Alternatively if signed to unsigned upgrade is not possible (perhaps in
> every language), then in the C++ code, we should use the bitmap operations
> directory ( `<<3` for division by 8, and ` & 0x7` for modulo by 8
> operation), instead of `/` and `%`.

Thank you for noticing this.  Switching to unsigned (and do a
static_cast to unsigned) sounds good to me.

Regards

Antoine.


Re: [JAVA] Arrow performance measurement

2018-11-26 Thread Animesh Trivedi
t; > Parquet,
>> >> > > > > ORC,
>> >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table).
>> And
>> >> then
>> >> > > > > > > curiously
>> >> > > > > > > > benchmarked Arrow as well because its design choices are
>> a
>> >> better
>> >> > > > > fit for
>> >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am
>> >> > > > > investigating.
>> >> > > > > > > From
>> >> > > > > > > > this point of view, I am trying to find out can Arrow be
>> >> the file
>> >> > > > > format
>> >> > > > > > > > for the next generation of storage/networking devices
>> (see
>> >> Apache
>> >> > > > > Crail
>> >> > > > > > > > project) delivering close to the hardware speed
>> >> reading/writing
>> >> > > > > rates. As
>> >> > > > > > > > Wes pointed out that a C++ library implementation should
>> be
>> >> > > memory-IO
>> >> > > > > > > > bound, so what would it take to deliver the same
>> >> performance in
>> >> > > Java
>> >> > > > > ;)
>> >> > > > > > > > (and then, from across the network).
>> >> > > > > > > >
>> >> > > > > > > > I hope this makes sense.
>> >> > > > > > > >
>> >> > > > > > > > Cheers,
>> >> > > > > > > > --
>> >> > > > > > > > Animesh
>> >> > > > > > > >
>> >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> >> > > jacq...@apache.org>
>> >> > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > My big question is what is the use case and how/what
>> are
>> >> you
>> >> > > > > trying to
>> >> > > > > > > > > compare? Arrow's implementation is more focused on
>> >> interacting
>> >> > > > > with the
>> >> > > > > > > > > structure than transporting it. Generally speaking,
>> when
>> >> we're
>> >> > > > > working
>> >> > > > > > > with
>> >> > > > > > > > > Arrow data we frequently are just interacting with
>> memory
>> >> > > > > locations and
>> >> > > > > > > > > doing direct operations. If you have a layer that
>> >> supports that
>> >> > > > > type of
>> >> > > > > > > > > semantic, create a movement technique that depends on
>> >> that.
>> >> > > Arrow
>> >> > > > > > > doesn't
>> >> > > > > > > > > force a particular API since the data itself is defined
>> >> by its
>> >> > > > > > > in-memory
>> >> > > > > > > > > layout so if you have a custom use or pattern, just
>> work
>> >> with
>> >> > > the
>> >> > > > > > > in-memory
>> >> > > > > > > > > structures.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> >> > > wesmck...@gmail.com>
>> >> > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > hi Animesh,
>> >> > > > > > > > > >
>> >> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> >> going
>> >> > > to be
>> >> > > > > > > > > > IO/memory bandwidth bound since you're interacting
>> with
>> >&

Re: [JAVA] Arrow performance measurement

2018-10-11 Thread Li Jin
; > > >
>> > > > > > > > Cheers,
>> > > > > > > > --
>> > > > > > > > Animesh
>> > > > > > > >
>> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
>> > > jacq...@apache.org>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > My big question is what is the use case and how/what are
>> you
>> > > > > trying to
>> > > > > > > > > compare? Arrow's implementation is more focused on
>> interacting
>> > > > > with the
>> > > > > > > > > structure than transporting it. Generally speaking, when
>> we're
>> > > > > working
>> > > > > > > with
>> > > > > > > > > Arrow data we frequently are just interacting with memory
>> > > > > locations and
>> > > > > > > > > doing direct operations. If you have a layer that
>> supports that
>> > > > > type of
>> > > > > > > > > semantic, create a movement technique that depends on
>> that.
>> > > Arrow
>> > > > > > > doesn't
>> > > > > > > > > force a particular API since the data itself is defined
>> by its
>> > > > > > > in-memory
>> > > > > > > > > layout so if you have a custom use or pattern, just work
>> with
>> > > the
>> > > > > > > in-memory
>> > > > > > > > > structures.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
>> > > wesmck...@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > hi Animesh,
>> > > > > > > > > >
>> > > > > > > > > > Per Johan's comments, the C++ library is essentially
>> going
>> > > to be
>> > > > > > > > > > IO/memory bandwidth bound since you're interacting with
>> raw
>> > > > > pointers.
>> > > > > > > > > >
>> > > > > > > > > > I'm looking at your code
>> > > > > > > > > >
>> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
>> > > > > > > > > > Float4Vector accessor = (Float4Vector) fv;
>> > > > > > > > > > int valCount = accessor.getValueCount();
>> > > > > > > > > > for(int i = 0; i < valCount; i++){
>> > > > > > > > > > if(!accessor.isNull(i)){
>> > > > > > > > > > float4Count+=1;
>> > > > > > > > > > checksum+=accessor.get(i);
>> > > > > > > > > > }
>> > > > > > > > > > }
>> > > > > > > > > > }
>> > > > > > > > > >
>> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
>> advise
>> > > you
>> > > > > the
>> > > > > > > > > > fastest way to iterate over this data -- my
>> understanding is
>> > > that
>> > > > > > > much
>> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
>> > > objects
>> > > > > > > > > > rather than going through the higher level APIs. You're
>> also
>> > > > > dropping
>> > > > > > > > > > performance because memory mapping is not yet
>> implemented in
>> > > > > Java,
>> > > > > > > see
>> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
>> > > > > > > > > >
>> > > > > > > > > > Furthermore, the IPC reader class you are using could
>> be made
>> > > > > more
>> > 

Re: [JAVA] Arrow performance measurement

2018-10-11 Thread Li Jin
> > > > > > IO/memory bandwidth bound since you're interacting with
> raw
> > > > > pointers.
> > > > > > > > > >
> > > > > > > > > > I'm looking at your code
> > > > > > > > > >
> > > > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > > > > Float4Vector accessor = (Float4Vector) fv;
> > > > > > > > > > int valCount = accessor.getValueCount();
> > > > > > > > > > for(int i = 0; i < valCount; i++){
> > > > > > > > > > if(!accessor.isNull(i)){
> > > > > > > > > > float4Count+=1;
> > > > > > > > > > checksum+=accessor.get(i);
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to
> advise
> > > you
> > > > > the
> > > > > > > > > > fastest way to iterate over this data -- my
> understanding is
> > > that
> > > > > > > much
> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> > > objects
> > > > > > > > > > rather than going through the higher level APIs. You're
> also
> > > > > dropping
> > > > > > > > > > performance because memory mapping is not yet
> implemented in
> > > > > Java,
> > > > > > > see
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > > > >
> > > > > > > > > > Furthermore, the IPC reader class you are using could be
> made
> > > > > more
> > > > > > > > > > efficient. I described the problem in
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> > > will be
> > > > > > > > > > required as soon as we have the ability to do memory
> mapping
> > > in
> > > > > Java
> > > > > > > > > >
> > > > > > > > > > Could Crail use the Arrow data structures in its runtime
> > > rather
> > > > > than
> > > > > > > > > > copying? If not, how are Crail's runtime data structures
> > > > > different?
> > > > > > > > > >
> > > > > > > > > > - Wes
> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello Animesh,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing. We
> > > have
> > > > > > > performed
> > > > > > > > > > some similar measurements to your third case in the past
> for
> > > > > C/C++ on
> > > > > > > > > > collections of various basic types such as primitives and
> > > > > strings.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I can say that in terms of consuming data from the
> Arrow
> > > format
> > > > > > > versus
> > > > > > > > > > language native collections in C++, I've so far always
> been
> > > able
> > > > > to
> > > > > > > get
> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to the
> > > Arrow
> > > > > > > format
> > > > > > > > > > itself). Especially when accessing the data through
> Arrow's
> > > raw
> > > > > data
> > > > > > > > > > pointers (and using for example std::string_view-like
> > > > > constructs).
> > > > > > > > > > >
> > > > > > > > > > > In C/C++ the fast data structures are engineered in
> such a
> > > way
> > > > > > > that as
> > > > > > > > > > little pointer traversals are required and they take up
> an as
> > > > > small
> > > > > > > as
> > > > > > > > > > possible memory footprint. Thus each memory access is
> > > relatively
> > > > > > > > > efficient
> > > > > > > > > > (in terms of obtaining the data of interest). The same
> can
> > > > > > > absolutely be
> > > > > > > > > > said for Arrow, if not even more efficient in some cases
> > > where
> > > > > object
> > > > > > > > > > fields are of variable length.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap.
> This
> > > > > requires
> > > > > > > the
> > > > > > > > > > JVM to interface to it through some calls to Unsafe
> hidden
> > > under
> > > > > the
> > > > > > > > > Netty
> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an
> expert
> > > on
> > > > > > > this).
> > > > > > > > > > Those calls are the only reason I can think of that would
> > > > > degrade the
> > > > > > > > > > performance a bit compared to a pure JAva case. I don't
> know
> > > if
> > > > > the
> > > > > > > > > Unsafe
> > > > > > > > > > calls are inlined during JIT compilation. If they aren't
> they
> > > > > will
> > > > > > > > > increase
> > > > > > > > > > access latency to any data a little bit.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I don't have a similar machine so it's not easy to
> relate
> > > my
> > > > > > > numbers to
> > > > > > > > > > yours, but if you can get that data consumed with 100
> Gbps
> > > in a
> > > > > pure
> > > > > > > Java
> > > > > > > > > > case, I don't see any reason (resulting from Arrow
> format /
> > > > > off-heap
> > > > > > > > > > storage) why you wouldn't be able to get at least really
> > > close.
> > > > > Can
> > > > > > > you
> > > > > > > > > get
> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java with
> your
> > > > > > > consumption
> > > > > > > > > > functions in the first place?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm interested to see your progress on this.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Johan Peltenburg
> > > > > > > > > > >
> > > > > > > > > > > 
> > > > > > > > > > > From: Animesh Trivedi 
> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > > > > > > To: dev@arrow.apache.org; d...@crail.apache.org
> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > A week ago, Wes and I had a discussion about the
> > > performance
> > > > > of the
> > > > > > > > > > > Arrow/Java implementation on the Apache Crail
> (Incubating)
> > > > > mailing
> > > > > > > > > list (
> > > > > > > > > > >
> > > > > > >
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> > > > > > > > > ).
> > > > > > > > > > In
> > > > > > > > > > > a nutshell: I am investigating the performance of
> various
> > > file
> > > > > > > formats
> > > > > > > > > > > (including Arrow) on high-performance NVMe and
> > > > > RDMA/100Gbps/RoCE
> > > > > > > > > setups.
> > > > > > > > > > I
> > > > > > > > > > > benchmarked how long does it take to materialize values
> > > (ints,
> > > > > > > longs,
> > > > > > > > > > > doubles) of the store_sales table, the largest table
> in the
> > > > > TPC-DS
> > > > > > > > > > dataset
> > > > > > > > > > > stored on different file formats. Here is a write-up on
> > > this -
> > > > > > > > > > >
> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I
> > > > > > > found
> > > > > > > > > > that
> > > > > > > > > > > between a pair of machine connected over a 100 Gbps
> link,
> > > Arrow
> > > > > > > (using
> > > > > > > > > > as a
> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of
> > > bandwidth
> > > > > with
> > > > > > > all
> > > > > > > > > 16
> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is
> in-memory
> > > IPC
> > > > > > > format,
> > > > > > > > > > and
> > > > > > > > > > > has not been optimized for storage interfaces/APIs like
> > > HDFS;
> > > > > (ii)
> > > > > > > the
> > > > > > > > > > > performance I am measuring is for the java
> implementation.
> > > > > > > > > > >
> > > > > > > > > > > Wes, I hope I summarized our discussion correctly.
> > > > > > > > > > >
> > > > > > > > > > > That brings us to this email where I promised to
> follow up
> > > on
> > > > > the
> > > > > > > Arrow
> > > > > > > > > > > mailing list to understand and optimize the
> performance of
> > > > > > > Arrow/Java
> > > > > > > > > > > implementation on high-performance devices. I wrote a
> small
> > > > > > > stand-alone
> > > > > > > > > > > benchmark (
> > > > > https://github.com/animeshtrivedi/benchmarking-arrow)
> > > > > > > with
> > > > > > > > > > three
> > > > > > > > > > > implementations of WritableByteChannel,
> SeekableByteChannel
> > > > > > > interfaces:
> > > > > > > > > > >
> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me
> ~30
> > > Gbps
> > > > > > > > > > performance
> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me
> > > ~35-36
> > > > > Gbps
> > > > > > > > > > > performance
> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this gives
> me
> > > ~39
> > > > > Gbps
> > > > > > > > > > > performance
> > > > > > > > > > >
> > > > > > > > > > > I think the order makes sense. To better understand the
> > > > > > > performance of
> > > > > > > > > > > Arrow/Java we can focus on the option 3.
> > > > > > > > > > >
> > > > > > > > > > > The key question I am trying to answer is "what would
> it
> > > take
> > > > > for
> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it
> > > > > possible? If
> > > > > > > > > yes,
> > > > > > > > > > > then what is missing/or mis-interpreted by me? If not,
> then
> > > > > where
> > > > > > > is
> > > > > > > > > the
> > > > > > > > > > > performance lost? Does anyone have any performance
> > > measurements
> > > > > > > for C++
> > > > > > > > > > > implementation? if they have seen/expect better
> numbers.
> > > > > > > > > > >
> > > > > > > > > > > As a next step, I will profile the read path of
> Arrow/Java
> > > for
> > > > > the
> > > > > > > > > option
> > > > > > > > > > > 3. I will report my findings.
> > > > > > > > > > >
> > > > > > > > > > > Any thoughts and feedback on this investigation are
> very
> > > > > welcome.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > --
> > > > > > > > > > > Animesh
> > > > > > > > > > >
> > > > > > > > > > > PS~ Cross-posting on the d...@crail.apache.org list as
> > > well.
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>


Re: [JAVA] Arrow performance measurement

2018-10-11 Thread Wes McKinney
 > > > > project) delivering close to the hardware speed reading/writing
> > > > rates. As
> > > > > > > Wes pointed out that a C++ library implementation should be
> > memory-IO
> > > > > > > bound, so what would it take to deliver the same performance in
> > Java
> > > > ;)
> > > > > > > (and then, from across the network).
> > > > > > >
> > > > > > > I hope this makes sense.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > --
> > > > > > > Animesh
> > > > > > >
> > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau <
> > jacq...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > My big question is what is the use case and how/what are you
> > > > trying to
> > > > > > > > compare? Arrow's implementation is more focused on interacting
> > > > with the
> > > > > > > > structure than transporting it. Generally speaking, when we're
> > > > working
> > > > > > with
> > > > > > > > Arrow data we frequently are just interacting with memory
> > > > locations and
> > > > > > > > doing direct operations. If you have a layer that supports that
> > > > type of
> > > > > > > > semantic, create a movement technique that depends on that.
> > Arrow
> > > > > > doesn't
> > > > > > > > force a particular API since the data itself is defined by its
> > > > > > in-memory
> > > > > > > > layout so if you have a custom use or pattern, just work with
> > the
> > > > > > in-memory
> > > > > > > > structures.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney <
> > wesmck...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > hi Animesh,
> > > > > > > > >
> > > > > > > > > Per Johan's comments, the C++ library is essentially going
> > to be
> > > > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > > > pointers.
> > > > > > > > >
> > > > > > > > > I'm looking at your code
> > > > > > > > >
> > > > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > > > Float4Vector accessor = (Float4Vector) fv;
> > > > > > > > > int valCount = accessor.getValueCount();
> > > > > > > > > for(int i = 0; i < valCount; i++){
> > > > > > > > > if(!accessor.isNull(i)){
> > > > > > > > > float4Count+=1;
> > > > > > > > > checksum+=accessor.get(i);
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise
> > you
> > > > the
> > > > > > > > > fastest way to iterate over this data -- my understanding is
> > that
> > > > > > much
> > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf
> > objects
> > > > > > > > > rather than going through the higher level APIs. You're also
> > > > dropping
> > > > > > > > > performance because memory mapping is not yet implemented in
> > > > Java,
> > > > > > see
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > > > >
> > > > > > > > > Furthermore, the IPC reader class you are using could be made
> > > > more
> > > > > > > > > efficient. I described the problem in
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this
> > will be
> > > > > > > > > required as s

Re: [JAVA] Arrow performance measurement

2018-10-04 Thread Wes McKinney
at is the use case and how/what are you
> > trying to
> > > > > > compare? Arrow's implementation is more focused on interacting
> > with the
> > > > > > structure than transporting it. Generally speaking, when we're
> > working
> > > > with
> > > > > > Arrow data we frequently are just interacting with memory
> > locations and
> > > > > > doing direct operations. If you have a layer that supports that
> > type of
> > > > > > semantic, create a movement technique that depends on that. Arrow
> > > > doesn't
> > > > > > force a particular API since the data itself is defined by its
> > > > in-memory
> > > > > > layout so if you have a custom use or pattern, just work with the
> > > > in-memory
> > > > > > structures.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney 
> > > > wrote:
> > > > > >
> > > > > > > hi Animesh,
> > > > > > >
> > > > > > > Per Johan's comments, the C++ library is essentially going to be
> > > > > > > IO/memory bandwidth bound since you're interacting with raw
> > pointers.
> > > > > > >
> > > > > > > I'm looking at your code
> > > > > > >
> > > > > > > private void consumeFloat4(FieldVector fv) {
> > > > > > > Float4Vector accessor = (Float4Vector) fv;
> > > > > > > int valCount = accessor.getValueCount();
> > > > > > > for(int i = 0; i < valCount; i++){
> > > > > > > if(!accessor.isNull(i)){
> > > > > > > float4Count+=1;
> > > > > > > checksum+=accessor.get(i);
> > > > > > > }
> > > > > > > }
> > > > > > > }
> > > > > > >
> > > > > > > You'll want to get a Java-Arrow expert from Dremio to advise you
> > the
> > > > > > > fastest way to iterate over this data -- my understanding is that
> > > > much
> > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > > > > > rather than going through the higher level APIs. You're also
> > dropping
> > > > > > > performance because memory mapping is not yet implemented in
> > Java,
> > > > see
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > > > >
> > > > > > > Furthermore, the IPC reader class you are using could be made
> > more
> > > > > > > efficient. I described the problem in
> > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > > > required as soon as we have the ability to do memory mapping in
> > Java
> > > > > > >
> > > > > > > Could Crail use the Arrow data structures in its runtime rather
> > than
> > > > > > > copying? If not, how are Crail's runtime data structures
> > different?
> > > > > > >
> > > > > > > - Wes
> > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hello Animesh,
> > > > > > > >
> > > > > > > >
> > > > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > > > performed
> > > > > > > some similar measurements to your third case in the past for
> > C/C++ on
> > > > > > > collections of various basic types such as primitives and
> > strings.
> > > > > > > >
> > > > > > > >
> > > > > > > > I can say that in terms of consuming data from the Arrow format
> > > > versus
> > > > > > > language native collections in C++, I've so far always been able
> > to
> > > > get
> > > > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > > > format
> > > > > > > itself). Especially when accessing the data through Arrow's raw
> > data
> > > >

Re: [JAVA] Arrow performance measurement

2018-10-04 Thread Wes McKinney
e because memory mapping is not yet implemented in Java,
> > see
> > > > > https://issues.apache.org/jira/browse/ARROW-3191.
> > > > >
> > > > > Furthermore, the IPC reader class you are using could be made more
> > > > > efficient. I described the problem in
> > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > > > required as soon as we have the ability to do memory mapping in Java
> > > > >
> > > > > Could Crail use the Arrow data structures in its runtime rather than
> > > > > copying? If not, how are Crail's runtime data structures different?
> > > > >
> > > > > - Wes
> > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > > > >  wrote:
> > > > > >
> > > > > > Hello Animesh,
> > > > > >
> > > > > >
> > > > > > I browsed a bit in your sources, thanks for sharing. We have
> > performed
> > > > > some similar measurements to your third case in the past for C/C++ on
> > > > > collections of various basic types such as primitives and strings.
> > > > > >
> > > > > >
> > > > > > I can say that in terms of consuming data from the Arrow format
> > versus
> > > > > language native collections in C++, I've so far always been able to
> > get
> > > > > similar performance numbers (e.g. no drawbacks due to the Arrow
> > format
> > > > > itself). Especially when accessing the data through Arrow's raw data
> > > > > pointers (and using for example std::string_view-like constructs).
> > > > > >
> > > > > > In C/C++ the fast data structures are engineered in such a way
> > that as
> > > > > little pointer traversals are required and they take up an as small
> > as
> > > > > possible memory footprint. Thus each memory access is relatively
> > > > efficient
> > > > > (in terms of obtaining the data of interest). The same can
> > absolutely be
> > > > > said for Arrow, if not even more efficient in some cases where object
> > > > > fields are of variable length.
> > > > > >
> > > > > >
> > > > > > In the JVM case, the Arrow data is stored off-heap. This requires
> > the
> > > > > JVM to interface to it through some calls to Unsafe hidden under the
> > > > Netty
> > > > > layer (but please correct me if I'm wrong, I'm not an expert on
> > this).
> > > > > Those calls are the only reason I can think of that would degrade the
> > > > > performance a bit compared to a pure JAva case. I don't know if the
> > > > Unsafe
> > > > > calls are inlined during JIT compilation. If they aren't they will
> > > > increase
> > > > > access latency to any data a little bit.
> > > > > >
> > > > > >
> > > > > > I don't have a similar machine so it's not easy to relate my
> > numbers to
> > > > > yours, but if you can get that data consumed with 100 Gbps in a pure
> > Java
> > > > > case, I don't see any reason (resulting from Arrow format / off-heap
> > > > > storage) why you wouldn't be able to get at least really close. Can
> > you
> > > > get
> > > > > to 100 Gbps starting from primitive arrays in Java with your
> > consumption
> > > > > functions in the first place?
> > > > > >
> > > > > >
> > > > > > I'm interested to see your progress on this.
> > > > > >
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > Johan Peltenburg
> > > > > >
> > > > > > 
> > > > > > From: Animesh Trivedi 
> > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > > > > To: dev@arrow.apache.org; d...@crail.apache.org
> > > > > > Subject: [JAVA] Arrow performance measurement
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > A week ago, Wes and I had a discussion about the performance of the
> > > > > > Arrow/Java implementation on the Apache Crail (Incubating) mailing
>

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Wes McKinney
orce a particular API since the data itself is defined by its in-memory
> > layout so if you have a custom use or pattern, just work with the in-memory
> > structures.
> >
> >
> >
> > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney  wrote:
> >
> > > hi Animesh,
> > >
> > > Per Johan's comments, the C++ library is essentially going to be
> > > IO/memory bandwidth bound since you're interacting with raw pointers.
> > >
> > > I'm looking at your code
> > >
> > > private void consumeFloat4(FieldVector fv) {
> > > Float4Vector accessor = (Float4Vector) fv;
> > > int valCount = accessor.getValueCount();
> > > for(int i = 0; i < valCount; i++){
> > > if(!accessor.isNull(i)){
> > > float4Count+=1;
> > > checksum+=accessor.get(i);
> > > }
> > > }
> > > }
> > >
> > > You'll want to get a Java-Arrow expert from Dremio to advise you the
> > > fastest way to iterate over this data -- my understanding is that much
> > > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > > rather than going through the higher level APIs. You're also dropping
> > > performance because memory mapping is not yet implemented in Java, see
> > > https://issues.apache.org/jira/browse/ARROW-3191.
> > >
> > > Furthermore, the IPC reader class you are using could be made more
> > > efficient. I described the problem in
> > > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > > required as soon as we have the ability to do memory mapping in Java
> > >
> > > Could Crail use the Arrow data structures in its runtime rather than
> > > copying? If not, how are Crail's runtime data structures different?
> > >
> > > - Wes
> > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> > >  wrote:
> > > >
> > > > Hello Animesh,
> > > >
> > > >
> > > > I browsed a bit in your sources, thanks for sharing. We have performed
> > > some similar measurements to your third case in the past for C/C++ on
> > > collections of various basic types such as primitives and strings.
> > > >
> > > >
> > > > I can say that in terms of consuming data from the Arrow format versus
> > > language native collections in C++, I've so far always been able to get
> > > similar performance numbers (e.g. no drawbacks due to the Arrow format
> > > itself). Especially when accessing the data through Arrow's raw data
> > > pointers (and using for example std::string_view-like constructs).
> > > >
> > > > In C/C++ the fast data structures are engineered in such a way that as
> > > little pointer traversals are required and they take up an as small as
> > > possible memory footprint. Thus each memory access is relatively
> > efficient
> > > (in terms of obtaining the data of interest). The same can absolutely be
> > > said for Arrow, if not even more efficient in some cases where object
> > > fields are of variable length.
> > > >
> > > >
> > > > In the JVM case, the Arrow data is stored off-heap. This requires the
> > > JVM to interface to it through some calls to Unsafe hidden under the
> > Netty
> > > layer (but please correct me if I'm wrong, I'm not an expert on this).
> > > Those calls are the only reason I can think of that would degrade the
> > > performance a bit compared to a pure JAva case. I don't know if the
> > Unsafe
> > > calls are inlined during JIT compilation. If they aren't they will
> > increase
> > > access latency to any data a little bit.
> > > >
> > > >
> > > > I don't have a similar machine so it's not easy to relate my numbers to
> > > yours, but if you can get that data consumed with 100 Gbps in a pure Java
> > > case, I don't see any reason (resulting from Arrow format / off-heap
> > > storage) why you wouldn't be able to get at least really close. Can you
> > get
> > > to 100 Gbps starting from primitive arrays in Java with your consumption
> > > functions in the first place?
> > > >
> > > >
> > > > I'm interested to see your progress on this.
> > > >
> > > >
> > > > Kind regards,
> > > >
> > > >
> > > > Johan Peltenburg
> > > >
> > > > 
&g

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Animesh Trivedi
s data -- my understanding is that much
> > code in Dremio interacts with the wrapped Netty ArrowBuf objects
> > rather than going through the higher level APIs. You're also dropping
> > performance because memory mapping is not yet implemented in Java, see
> > https://issues.apache.org/jira/browse/ARROW-3191.
> >
> > Furthermore, the IPC reader class you are using could be made more
> > efficient. I described the problem in
> > https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> > required as soon as we have the ability to do memory mapping in Java
> >
> > Could Crail use the Arrow data structures in its runtime rather than
> > copying? If not, how are Crail's runtime data structures different?
> >
> > - Wes
> > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
> >  wrote:
> > >
> > > Hello Animesh,
> > >
> > >
> > > I browsed a bit in your sources, thanks for sharing. We have performed
> > some similar measurements to your third case in the past for C/C++ on
> > collections of various basic types such as primitives and strings.
> > >
> > >
> > > I can say that in terms of consuming data from the Arrow format versus
> > language native collections in C++, I've so far always been able to get
> > similar performance numbers (e.g. no drawbacks due to the Arrow format
> > itself). Especially when accessing the data through Arrow's raw data
> > pointers (and using for example std::string_view-like constructs).
> > >
> > > In C/C++ the fast data structures are engineered in such a way that as
> > little pointer traversals are required and they take up an as small as
> > possible memory footprint. Thus each memory access is relatively
> efficient
> > (in terms of obtaining the data of interest). The same can absolutely be
> > said for Arrow, if not even more efficient in some cases where object
> > fields are of variable length.
> > >
> > >
> > > In the JVM case, the Arrow data is stored off-heap. This requires the
> > JVM to interface to it through some calls to Unsafe hidden under the
> Netty
> > layer (but please correct me if I'm wrong, I'm not an expert on this).
> > Those calls are the only reason I can think of that would degrade the
> > performance a bit compared to a pure JAva case. I don't know if the
> Unsafe
> > calls are inlined during JIT compilation. If they aren't they will
> increase
> > access latency to any data a little bit.
> > >
> > >
> > > I don't have a similar machine so it's not easy to relate my numbers to
> > yours, but if you can get that data consumed with 100 Gbps in a pure Java
> > case, I don't see any reason (resulting from Arrow format / off-heap
> > storage) why you wouldn't be able to get at least really close. Can you
> get
> > to 100 Gbps starting from primitive arrays in Java with your consumption
> > functions in the first place?
> > >
> > >
> > > I'm interested to see your progress on this.
> > >
> > >
> > > Kind regards,
> > >
> > >
> > > Johan Peltenburg
> > >
> > > 
> > > From: Animesh Trivedi 
> > > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > > To: dev@arrow.apache.org; d...@crail.apache.org
> > > Subject: [JAVA] Arrow performance measurement
> > >
> > > Hi all,
> > >
> > > A week ago, Wes and I had a discussion about the performance of the
> > > Arrow/Java implementation on the Apache Crail (Incubating) mailing
> list (
> > > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser
> ).
> > In
> > > a nutshell: I am investigating the performance of various file formats
> > > (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE
> setups.
> > I
> > > benchmarked how long does it take to materialize values (ints, longs,
> > > doubles) of the store_sales table, the largest table in the TPC-DS
> > dataset
> > > stored on different file formats. Here is a write-up on this -
> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found
> > that
> > > between a pair of machine connected over a 100 Gbps link, Arrow (using
> > as a
> > > file format on HDFS) delivered close to ~30 Gbps of bandwidth with all
> 16
> > > cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format,
> > and
> > > has not been optimized for storage interfaces/APIs like HDFS; (ii) the
&g

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Jacques Nadeau
My big question is what is the use case and how/what are you trying to
compare? Arrow's implementation is more focused on interacting with the
structure than transporting it. Generally speaking, when we're working with
Arrow data we frequently are just interacting with memory locations and
doing direct operations. If you have a layer that supports that type of
semantic, create a movement technique that depends on that. Arrow doesn't
force a particular API since the data itself is defined by its in-memory
layout so if you have a custom use or pattern, just work with the in-memory
structures.



On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney  wrote:

> hi Animesh,
>
> Per Johan's comments, the C++ library is essentially going to be
> IO/memory bandwidth bound since you're interacting with raw pointers.
>
> I'm looking at your code
>
> private void consumeFloat4(FieldVector fv) {
> Float4Vector accessor = (Float4Vector) fv;
> int valCount = accessor.getValueCount();
> for(int i = 0; i < valCount; i++){
> if(!accessor.isNull(i)){
> float4Count+=1;
> checksum+=accessor.get(i);
> }
> }
> }
>
> You'll want to get a Java-Arrow expert from Dremio to advise you the
> fastest way to iterate over this data -- my understanding is that much
> code in Dremio interacts with the wrapped Netty ArrowBuf objects
> rather than going through the higher level APIs. You're also dropping
> performance because memory mapping is not yet implemented in Java, see
> https://issues.apache.org/jira/browse/ARROW-3191.
>
> Furthermore, the IPC reader class you are using could be made more
> efficient. I described the problem in
> https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
> required as soon as we have the ability to do memory mapping in Java
>
> Could Crail use the Arrow data structures in its runtime rather than
> copying? If not, how are Crail's runtime data structures different?
>
> - Wes
> On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
>  wrote:
> >
> > Hello Animesh,
> >
> >
> > I browsed a bit in your sources, thanks for sharing. We have performed
> some similar measurements to your third case in the past for C/C++ on
> collections of various basic types such as primitives and strings.
> >
> >
> > I can say that in terms of consuming data from the Arrow format versus
> language native collections in C++, I've so far always been able to get
> similar performance numbers (e.g. no drawbacks due to the Arrow format
> itself). Especially when accessing the data through Arrow's raw data
> pointers (and using for example std::string_view-like constructs).
> >
> > In C/C++ the fast data structures are engineered in such a way that as
> little pointer traversals are required and they take up an as small as
> possible memory footprint. Thus each memory access is relatively efficient
> (in terms of obtaining the data of interest). The same can absolutely be
> said for Arrow, if not even more efficient in some cases where object
> fields are of variable length.
> >
> >
> > In the JVM case, the Arrow data is stored off-heap. This requires the
> JVM to interface to it through some calls to Unsafe hidden under the Netty
> layer (but please correct me if I'm wrong, I'm not an expert on this).
> Those calls are the only reason I can think of that would degrade the
> performance a bit compared to a pure JAva case. I don't know if the Unsafe
> calls are inlined during JIT compilation. If they aren't they will increase
> access latency to any data a little bit.
> >
> >
> > I don't have a similar machine so it's not easy to relate my numbers to
> yours, but if you can get that data consumed with 100 Gbps in a pure Java
> case, I don't see any reason (resulting from Arrow format / off-heap
> storage) why you wouldn't be able to get at least really close. Can you get
> to 100 Gbps starting from primitive arrays in Java with your consumption
> functions in the first place?
> >
> >
> > I'm interested to see your progress on this.
> >
> >
> > Kind regards,
> >
> >
> > Johan Peltenburg
> >
> > 
> > From: Animesh Trivedi 
> > Sent: Wednesday, September 19, 2018 2:08:50 PM
> > To: dev@arrow.apache.org; d...@crail.apache.org
> > Subject: [JAVA] Arrow performance measurement
> >
> > Hi all,
> >
> > A week ago, Wes and I had a discussion about the performance of the
> > Arrow/Java implementation on the Apache Crail (Incubating) mailing list (
> > http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser).
> In
> > a nutshell: I am inve

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Wes McKinney
hi Animesh,

Per Johan's comments, the C++ library is essentially going to be
IO/memory bandwidth bound since you're interacting with raw pointers.

I'm looking at your code

private void consumeFloat4(FieldVector fv) {
Float4Vector accessor = (Float4Vector) fv;
int valCount = accessor.getValueCount();
for(int i = 0; i < valCount; i++){
if(!accessor.isNull(i)){
float4Count+=1;
checksum+=accessor.get(i);
}
}
}

You'll want to get a Java-Arrow expert from Dremio to advise you the
fastest way to iterate over this data -- my understanding is that much
code in Dremio interacts with the wrapped Netty ArrowBuf objects
rather than going through the higher level APIs. You're also dropping
performance because memory mapping is not yet implemented in Java, see
https://issues.apache.org/jira/browse/ARROW-3191.

Furthermore, the IPC reader class you are using could be made more
efficient. I described the problem in
https://issues.apache.org/jira/browse/ARROW-3192 -- this will be
required as soon as we have the ability to do memory mapping in Java

Could Crail use the Arrow data structures in its runtime rather than
copying? If not, how are Crail's runtime data structures different?

- Wes
On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI
 wrote:
>
> Hello Animesh,
>
>
> I browsed a bit in your sources, thanks for sharing. We have performed some 
> similar measurements to your third case in the past for C/C++ on collections 
> of various basic types such as primitives and strings.
>
>
> I can say that in terms of consuming data from the Arrow format versus 
> language native collections in C++, I've so far always been able to get 
> similar performance numbers (e.g. no drawbacks due to the Arrow format 
> itself). Especially when accessing the data through Arrow's raw data pointers 
> (and using for example std::string_view-like constructs).
>
> In C/C++ the fast data structures are engineered in such a way that as little 
> pointer traversals are required and they take up an as small as possible 
> memory footprint. Thus each memory access is relatively efficient (in terms 
> of obtaining the data of interest). The same can absolutely be said for 
> Arrow, if not even more efficient in some cases where object fields are of 
> variable length.
>
>
> In the JVM case, the Arrow data is stored off-heap. This requires the JVM to 
> interface to it through some calls to Unsafe hidden under the Netty layer 
> (but please correct me if I'm wrong, I'm not an expert on this). Those calls 
> are the only reason I can think of that would degrade the performance a bit 
> compared to a pure JAva case. I don't know if the Unsafe calls are inlined 
> during JIT compilation. If they aren't they will increase access latency to 
> any data a little bit.
>
>
> I don't have a similar machine so it's not easy to relate my numbers to 
> yours, but if you can get that data consumed with 100 Gbps in a pure Java 
> case, I don't see any reason (resulting from Arrow format / off-heap storage) 
> why you wouldn't be able to get at least really close. Can you get to 100 
> Gbps starting from primitive arrays in Java with your consumption functions 
> in the first place?
>
>
> I'm interested to see your progress on this.
>
>
> Kind regards,
>
>
> Johan Peltenburg
>
> ________
> From: Animesh Trivedi 
> Sent: Wednesday, September 19, 2018 2:08:50 PM
> To: dev@arrow.apache.org; d...@crail.apache.org
> Subject: [JAVA] Arrow performance measurement
>
> Hi all,
>
> A week ago, Wes and I had a discussion about the performance of the
> Arrow/Java implementation on the Apache Crail (Incubating) mailing list (
> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser). In
> a nutshell: I am investigating the performance of various file formats
> (including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE setups. I
> benchmarked how long does it take to materialize values (ints, longs,
> doubles) of the store_sales table, the largest table in the TPC-DS dataset
> stored on different file formats. Here is a write-up on this -
> https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found that
> between a pair of machine connected over a 100 Gbps link, Arrow (using as a
> file format on HDFS) delivered close to ~30 Gbps of bandwidth with all 16
> cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format, and
> has not been optimized for storage interfaces/APIs like HDFS; (ii) the
> performance I am measuring is for the java implementation.
>
> Wes, I hope I summarized our discussion correctly.
>
> That brings us to this email where I promised to follow up on the Arrow
> mailing list to understand and opt

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Johan Peltenburg - EWI
Hello Animesh,


I browsed a bit in your sources, thanks for sharing. We have performed some 
similar measurements to your third case in the past for C/C++ on collections of 
various basic types such as primitives and strings.


I can say that in terms of consuming data from the Arrow format versus language 
native collections in C++, I've so far always been able to get similar 
performance numbers (e.g. no drawbacks due to the Arrow format itself). 
Especially when accessing the data through Arrow's raw data pointers (and using 
for example std::string_view-like constructs).

In C/C++ the fast data structures are engineered in such a way that as little 
pointer traversals are required and they take up an as small as possible memory 
footprint. Thus each memory access is relatively efficient (in terms of 
obtaining the data of interest). The same can absolutely be said for Arrow, if 
not even more efficient in some cases where object fields are of variable 
length.


In the JVM case, the Arrow data is stored off-heap. This requires the JVM to 
interface to it through some calls to Unsafe hidden under the Netty layer (but 
please correct me if I'm wrong, I'm not an expert on this). Those calls are the 
only reason I can think of that would degrade the performance a bit compared to 
a pure JAva case. I don't know if the Unsafe calls are inlined during JIT 
compilation. If they aren't they will increase access latency to any data a 
little bit.


I don't have a similar machine so it's not easy to relate my numbers to yours, 
but if you can get that data consumed with 100 Gbps in a pure Java case, I 
don't see any reason (resulting from Arrow format / off-heap storage) why you 
wouldn't be able to get at least really close. Can you get to 100 Gbps starting 
from primitive arrays in Java with your consumption functions in the first 
place?


I'm interested to see your progress on this.


Kind regards,


Johan Peltenburg


From: Animesh Trivedi 
Sent: Wednesday, September 19, 2018 2:08:50 PM
To: dev@arrow.apache.org; d...@crail.apache.org
Subject: [JAVA] Arrow performance measurement

Hi all,

A week ago, Wes and I had a discussion about the performance of the
Arrow/Java implementation on the Apache Crail (Incubating) mailing list (
http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser). In
a nutshell: I am investigating the performance of various file formats
(including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE setups. I
benchmarked how long does it take to materialize values (ints, longs,
doubles) of the store_sales table, the largest table in the TPC-DS dataset
stored on different file formats. Here is a write-up on this -
https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found that
between a pair of machine connected over a 100 Gbps link, Arrow (using as a
file format on HDFS) delivered close to ~30 Gbps of bandwidth with all 16
cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format, and
has not been optimized for storage interfaces/APIs like HDFS; (ii) the
performance I am measuring is for the java implementation.

Wes, I hope I summarized our discussion correctly.

That brings us to this email where I promised to follow up on the Arrow
mailing list to understand and optimize the performance of Arrow/Java
implementation on high-performance devices. I wrote a small stand-alone
benchmark (https://github.com/animeshtrivedi/benchmarking-arrow) with three
implementations of WritableByteChannel, SeekableByteChannel interfaces:

1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps performance
2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
performance
3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
performance

I think the order makes sense. To better understand the performance of
Arrow/Java we can focus on the option 3.

The key question I am trying to answer is "what would it take for
Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If yes,
then what is missing/or mis-interpreted by me? If not, then where is the
performance lost? Does anyone have any performance measurements for C++
implementation? if they have seen/expect better numbers.

As a next step, I will profile the read path of Arrow/Java for the option
3. I will report my findings.

Any thoughts and feedback on this investigation are very welcome.

Cheers,
--
Animesh

PS~ Cross-posting on the d...@crail.apache.org list as well.


[JAVA] Arrow performance measurement

2018-09-19 Thread Animesh Trivedi
Hi all,

A week ago, Wes and I had a discussion about the performance of the
Arrow/Java implementation on the Apache Crail (Incubating) mailing list (
http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser). In
a nutshell: I am investigating the performance of various file formats
(including Arrow) on high-performance NVMe and RDMA/100Gbps/RoCE setups. I
benchmarked how long does it take to materialize values (ints, longs,
doubles) of the store_sales table, the largest table in the TPC-DS dataset
stored on different file formats. Here is a write-up on this -
https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I found that
between a pair of machine connected over a 100 Gbps link, Arrow (using as a
file format on HDFS) delivered close to ~30 Gbps of bandwidth with all 16
cores engaged. Wes pointed out that (i) Arrow is in-memory IPC format, and
has not been optimized for storage interfaces/APIs like HDFS; (ii) the
performance I am measuring is for the java implementation.

Wes, I hope I summarized our discussion correctly.

That brings us to this email where I promised to follow up on the Arrow
mailing list to understand and optimize the performance of Arrow/Java
implementation on high-performance devices. I wrote a small stand-alone
benchmark (https://github.com/animeshtrivedi/benchmarking-arrow) with three
implementations of WritableByteChannel, SeekableByteChannel interfaces:

1. Arrow data is stored in HDFS/tmpfs - this gives me ~30 Gbps performance
2. Arrow data is stored in Crail/DRAM - this gives me ~35-36 Gbps
performance
3. Arrow data is stored in on-heap byte[] - this gives me ~39 Gbps
performance

I think the order makes sense. To better understand the performance of
Arrow/Java we can focus on the option 3.

The key question I am trying to answer is "what would it take for
Arrow/Java to deliver 100+ Gbps of performance"? Is it possible? If yes,
then what is missing/or mis-interpreted by me? If not, then where is the
performance lost? Does anyone have any performance measurements for C++
implementation? if they have seen/expect better numbers.

As a next step, I will profile the read path of Arrow/Java for the option
3. I will report my findings.

Any thoughts and feedback on this investigation are very welcome.

Cheers,
--
Animesh

PS~ Cross-posting on the d...@crail.apache.org list as well.