Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?

2021-09-07 Thread QP Hou
A minor note on the Rust side of things. arrow-rs has a 2 weeks
release cycle, but arrow-datafusion mostly does release on demand at
the moment. Our most uptodate release processes are documented at [1]
and [2].

[1]: https://github.com/apache/arrow-rs/blob/master/dev/release/README.md
[2]: 
https://github.com/apache/arrow-datafusion/blob/master/dev/release/README.md

On Tue, Sep 7, 2021 at 4:01 PM Jacob Quinn  wrote:
>
> Thanks kou.
>
> I think the TODO action list looks good.
>
> The one point I think could use some additional discussion is around the
> release cadence: it IS desirable to be able to release more frequently than
> the parent repo 3-4 month cadence. But we also haven't had the frequency of
> commits to necessarily warrant a release every 2 weeks. I can think of two
> possible options, not sure if one or the other would be more compatible
> with the apache release process:
>
> 1) Allow for release-on-demand; this is idiomatic for most Julia packages
> I'm aware of. When a particular bug is fixed, or feature added, a user can
> request a release, a little discussion happens, and a new release is made.
> This approach would work well for the "bursty" kind of contributions we've
> seen to Arrow.jl where development by certain people will happen frequently
> for a while, then take a break for other things. This also avoids having
> "scheduled" releases (every 2 weeks, 3 months, etc.) where there hasn't
> been significant updates to necessarily warrant a new release. This
> approach may also facilitate differentiating between bugfix (patch)
> releases vs. new functionality releases (minor), since when a release is
> requested, it could be specified whether it should be patch or minor (or
> major).
>
> 2) Commit to a scheduled release pattern like every 2 weeks, once a month,
> etc. This has the advantage of consistency and clearer expectations for
> users/devs involved. A release also doesn't need to be requested, because
> we can just wait for the scheduled time to release. In terms of the
> "unnecessary releases" mentioned above, it could be as simple as
> "cancelling" a release if there hasn't been significant updates in the
> elapsed time period.
>
> My preference would be for 1), but that's influenced from what I'm familiar
> with in the Julia package ecosystem. It seems like it would still fit in
> the apache way since we would formally request a new release, wait the
> elapsed amount of time for voting (24 hours would be preferrable), then at
> the end of the voting period, a new release could be made.
>
> Thanks again kou for helping support the Julia implementation here.
>
> -Jacob
>
> 2)
>
> On Sun, Sep 5, 2021 at 3:25 PM Sutou Kouhei  wrote:
>
> > Hi,
> >
> > Sorry for the delay. This is a continuation of the "Status
> > of Arrow Julia implementation?" thread:
> >
> >
> > https://lists.apache.org/x/thread.html/r6d91286686d92837fbe21dd042801a57e3a7b00b5903ea90a754ac7b%40%3Cdev.arrow.apache.org%3E
> >
> > I summarize the current status, the next actions and items
> > to be discussed.
> >
> > The current status:
> >
> >   * The Julia Arrow implementation uses
> > https://github.com/JuliaData/Arrow.jl as a "dev branch"
> > instead of creating a branch in
> > https://github.com/apache/arrow
> >   * The Julia Arrow implementation wants to use GitHub
> > for the main issue management platform
> >   * The Julia Arrow implementation wants to release
> > more frequency than 1 release per 3-4 months
> >   * The current workflow of the Rust Arrow implementation
> > will also fit the Julia Arrow implementation
> >
> > The current workflow of the Rust Arrow implementation:
> >
> >
> > https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit#heading=h.kv1hwbhi3cmi
> >
> > * Uses apache/arrow-rs and apache/arrow-datafusion instead
> >   of apache/arrow for repository
> >
> > * Uses GitHub instead of JIRA for issue management
> >   platform
> >
> >
> > https://docs.google.com/document/d/1tMQ67iu8XyGGZuj--h9WQYB9inCk6c2sL_4xMTwENGc/edit
> >
> > * Releases a new minor and patch version every 2 weeks
> >   in addition to the quarterly release of the other releases
> >
> > The next actions after we get a consensus about this
> > discussion:
> >
> >   1. Start voting the Julia Arrow implementation move like
> >  the Rust's one:
> >
> >
> > https://lists.apache.org/x/thread.html/r44390a18b3fbb08ddb68aa4d12f37245d948984fae11a41494e5fc1d@%3Cdev.arrow.apache.org%3E
> >
> >   2. Create apache/arrow-julia
> >
> >   3. Start IP clearance process to import JuliaData/Arrow.jl
> >  to apache/arrow-julia
> >
> >  (We don't use julia/Arrow/ in apache/arrow.)
> >
> >   4. Import JuliaData/Arrow.jl to apache/arrow-julia
> >
> >   5. Prepare integration tests CI in apache/arrow-julia and apache/arrow
> >
> >   6. Prepare releasing tools in apache/arrow-julia and apache/arrow
> >
> >   7. Remove julia/... from apache/arrow and leave
> 

Re: [Question] Allocations along 64 byte cache lines

2021-09-07 Thread Yibo Cai

Thanks Jorge,

I'm wondering if the 64 bytes alignment requirement is for cache or for 
simd register(avx512?).


For simd, looks register width alignment does helps.
E.g., _mm_load_si128 can only load 128 bits aligned data, it performs 
better than _mm_loadu_si128, which supports unaligned load.


Again, be very skeptical to the benchmark :)
https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk


On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:

Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai  wrote:


Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
enough conditions, looks it does shows unaligned access has some penalty
over aligned access. But I don't think this is an issue in practice.

Please be very skeptical to this benchmark. It's hard to get it right
given the complexity of hardware, compiler, benchmark tool and env.

https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk


On 9/7/21 7:55 AM, Micah Kornfield wrote:


My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.



This is probably true.  [1] is the original mailing list discussion.  I
think lack of measurable differences and high overhead for 64 byte
alignment was the reason for relaxing to 8 byte alignment.

Specifically, I performed two types of tests, a "random sum" where we

compute the sum of the values taken at random indices, and "sum", where

we

sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at

least in

the latter, prefetching would help, but I do not observe any difference.



The most likely place I think where this could make a difference would be
for operations on wider types (Decimal128 and Decimal256).   Another

place

where I think alignment could help is when adding two primitive arrays

(it

sounds like this was summing a single array?).

[1]


https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E


On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou 

wrote:




Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :

Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,

since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.


Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability

to

interoperate with `std::Vec` and the ecosystem that is based on it,

since

std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For

anyone

interested, the background for this is this old PR [1] in this in

arrow2

[2].


I see. In the C++ implementation, we are not compatible with the default
allocator either (but C++ allocators as defined by the standard library
don't support resizing, which doesn't make them terribly useful for
Arrow anyway).


Neither myself in micro benches nor Ritchie from polars (query engine)

in

large scale benches observe any difference in the archs we have

available.

This is not consistent with the emphasis we put on the memory

alignments

discussion [3], and I am trying to understand the root cause for this
inconsistency.


My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.


By prefetching I mean implicit; no intrinsics involved.


Well, I'm not aware that implicit prefetching depends on alignment.

Regards

Antoine.









Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-09-07 Thread Jacques Nadeau
As Phillip mentioned, I think there is something powerful in producing a
standard serialized representation of compute operations beyond just Arrow
and I'd really like to create a broader community around it. This has been
something I had been independently thinking about for the last several
months. The discussion here has inspired me to start making real progress
on this work. As such, I created a new repository and site where I've
started to put together work around a new specification for compute. I
would love for the people here to help define this and will be looking to a
number of other communities to also contribute. One of my goals has been to
break the specification into a number of much smaller pieces [1] so that we
can make progress on each subsection without being overwhelmed by the
amount of content that must be reviewed.

Would love to hear people's ideas on this initiative.

The site is here: https://substrait.io/
The repo is here: https://github.com/substrait-io/substrait

[1] https://substrait.io/spec/specification/#components



On Wed, Sep 1, 2021 at 3:26 PM Phillip Cloud  wrote:

> Hey everyone,
>
> As many of you know, the compute IR project has a lot of interested parties
> and has generated a lot of feedback. In light of some of the feedback we’ve
> received, we want to stress that the specification is intended to have
> input from many diverse points of view and that we welcome folks outside of
> the Arrow community. We think there’s immense potential for a compute IR
> that multiple projects--including those outside of the Arrow umbrella--can
> leverage.
>
> With that in mind, Jacques has been working on something outside of the
> Arrow repo that’ll be shared in a few days, that is designed to bring those
> viewpoints to bear on the problem of generic relational computation that
> lives outside of the Arrow project.
>
> Inside Arrow, we think that a version of the in-development IR
> specifications from the last several weeks will add a ton of value by
> informing this new effort and would like to continue to move forward with a
> work-in-progress IR inside of Arrow for the time being to enable some work
> on API development (independent of exactly how things are serialized) to
> take place. It is very likely that we will adopt this broader specification
> once the dust has settled, so the format inside of Arrow will be relatively
> unstable for a while and not have backwards compatibility guarantees for
> now.
>
> The primary focus of the Arrow IR will be on shoring up APIs (producers and
> consumers), and we will also be moving the compute IR flatbuffers files out
> the format directory into another top-level directory in the repo.
>
> Thanks,
> Phillip
>
> On Mon, Aug 30, 2021 at 7:30 PM Weston Pace  wrote:
>
> > My (incredibly naive) interpretation is that there are three problems to
> > tackle.
> >
> > 1) How do you represent a graph and relational operators (join, union,
> > groupby, etc.)
> >  - The PR appears to be addressing this question fairly well
> > 2) How does a frontend query a backend to know what UDFs are supported.
> >  - I don't see anything in the spec for this (some comments touch on
> > it) but it seems like it would be necessary to build any kind of
> > system.
> > 3) Is there some well defined set of canonical UDFs that we can all
> > agree on the semantics for (e.g. addition, subtraction, etc.)
> >  - I thought, from earlier comments in this email thread, that the
> > goal was to avoid addressing this.  Although I think there is strong
> > value here as well.
> >
> > So what is the scope of this initiative?  If it is just #1 for example
> > then I don't see any need to put types in the IR (and I've commented
> > as such in the PR).  From a relational perspective isn't a UDF just a
> > black box Table -> UDF -> Table?
> >
> > On Mon, Aug 30, 2021 at 11:10 AM Phillip Cloud 
> wrote:
> > >
> > > Hey everyone,
> > >
> > > There's some interesting discussion around types and where their
> location
> > > is in the current PR [1] (and in fact whether to store them at all).
> > >
> > > It would be great to get some community feedback on this [2] part of
> the
> > PR
> > > in particular, because the choice of whether to store types at all has
> > > important design consequences.
> > >
> > > [1]: https://github.com/apache/arrow/pull/10934
> > > [2]: https://github.com/apache/arrow/pull/10934/files#r697025313
> > >
> > > On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield  >
> > > wrote:
> > >
> > > > As an FYI, Iceberg is also considering an IR in relation to view
> > support
> > > > [1].  I chimed in and pointed them to this thread and Wes's doc.
> > Phillip
> > > > and Jacques chimed in there as well.
> > > >
> > > > [1]
> > > >
> > > >
> >
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
> > > >
> > > > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud 
> > wrote:
> > > >
> > 

Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?

2021-09-07 Thread Jacob Quinn
Thanks kou.

I think the TODO action list looks good.

The one point I think could use some additional discussion is around the
release cadence: it IS desirable to be able to release more frequently than
the parent repo 3-4 month cadence. But we also haven't had the frequency of
commits to necessarily warrant a release every 2 weeks. I can think of two
possible options, not sure if one or the other would be more compatible
with the apache release process:

1) Allow for release-on-demand; this is idiomatic for most Julia packages
I'm aware of. When a particular bug is fixed, or feature added, a user can
request a release, a little discussion happens, and a new release is made.
This approach would work well for the "bursty" kind of contributions we've
seen to Arrow.jl where development by certain people will happen frequently
for a while, then take a break for other things. This also avoids having
"scheduled" releases (every 2 weeks, 3 months, etc.) where there hasn't
been significant updates to necessarily warrant a new release. This
approach may also facilitate differentiating between bugfix (patch)
releases vs. new functionality releases (minor), since when a release is
requested, it could be specified whether it should be patch or minor (or
major).

2) Commit to a scheduled release pattern like every 2 weeks, once a month,
etc. This has the advantage of consistency and clearer expectations for
users/devs involved. A release also doesn't need to be requested, because
we can just wait for the scheduled time to release. In terms of the
"unnecessary releases" mentioned above, it could be as simple as
"cancelling" a release if there hasn't been significant updates in the
elapsed time period.

My preference would be for 1), but that's influenced from what I'm familiar
with in the Julia package ecosystem. It seems like it would still fit in
the apache way since we would formally request a new release, wait the
elapsed amount of time for voting (24 hours would be preferrable), then at
the end of the voting period, a new release could be made.

Thanks again kou for helping support the Julia implementation here.

-Jacob

2)

On Sun, Sep 5, 2021 at 3:25 PM Sutou Kouhei  wrote:

> Hi,
>
> Sorry for the delay. This is a continuation of the "Status
> of Arrow Julia implementation?" thread:
>
>
> https://lists.apache.org/x/thread.html/r6d91286686d92837fbe21dd042801a57e3a7b00b5903ea90a754ac7b%40%3Cdev.arrow.apache.org%3E
>
> I summarize the current status, the next actions and items
> to be discussed.
>
> The current status:
>
>   * The Julia Arrow implementation uses
> https://github.com/JuliaData/Arrow.jl as a "dev branch"
> instead of creating a branch in
> https://github.com/apache/arrow
>   * The Julia Arrow implementation wants to use GitHub
> for the main issue management platform
>   * The Julia Arrow implementation wants to release
> more frequency than 1 release per 3-4 months
>   * The current workflow of the Rust Arrow implementation
> will also fit the Julia Arrow implementation
>
> The current workflow of the Rust Arrow implementation:
>
>
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit#heading=h.kv1hwbhi3cmi
>
> * Uses apache/arrow-rs and apache/arrow-datafusion instead
>   of apache/arrow for repository
>
> * Uses GitHub instead of JIRA for issue management
>   platform
>
>
> https://docs.google.com/document/d/1tMQ67iu8XyGGZuj--h9WQYB9inCk6c2sL_4xMTwENGc/edit
>
> * Releases a new minor and patch version every 2 weeks
>   in addition to the quarterly release of the other releases
>
> The next actions after we get a consensus about this
> discussion:
>
>   1. Start voting the Julia Arrow implementation move like
>  the Rust's one:
>
>
> https://lists.apache.org/x/thread.html/r44390a18b3fbb08ddb68aa4d12f37245d948984fae11a41494e5fc1d@%3Cdev.arrow.apache.org%3E
>
>   2. Create apache/arrow-julia
>
>   3. Start IP clearance process to import JuliaData/Arrow.jl
>  to apache/arrow-julia
>
>  (We don't use julia/Arrow/ in apache/arrow.)
>
>   4. Import JuliaData/Arrow.jl to apache/arrow-julia
>
>   5. Prepare integration tests CI in apache/arrow-julia and apache/arrow
>
>   6. Prepare releasing tools in apache/arrow-julia and apache/arrow
>
>   7. Remove julia/... from apache/arrow and leave
>  julia/README.md pointing to apache/arrow-julia
>
>
> Items to be discussed:
>
>   * Interval of minor and patch releases
>
> * The Rust Arrow implementation uses 2 weeks.
>
> * Does the Julia Arrow implementation also wants to use
>   2 weeks?
>
>   * Can we accordance with the Apache way with this workflow
> without pain?
>
> The Rust Arrow implementation workflow includes the
> following for this:
>
>
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit#heading=h.kv1hwbhi3cmi
>
>   > Contributors will be required to write issues for
>   > planned 

Re: HDFS ORC to Arrow Dataset

2021-09-07 Thread Weston Pace
I'll just add that a PR in in progress (thanks Joris!) for adding this
adapter: https://github.com/apache/arrow/pull/10991

On Tue, Sep 7, 2021 at 12:05 PM Wes McKinney  wrote:
>
> I'm missing context but if you're talking about C++/Python, we are
> currently missing a wrapper interface to the ORC reader in the Arrow
> datasets library
>
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset
>
> We have CSV, Arrow (IPC), and Parquet interfaces.
>
> But we have an HDFS filesystem implementation and an ORC reader
> implementation, so mechanically all of the pieces are there but need
> to be connected together.
>
> Thanks,
> Wes
>
> On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar  wrote:
> >
> > Hi Dev-Community,
> >
> > Anyone can help me to guide how to read ORC directly from HDFS to an
> > arrow dataset.
> >
> > Thanks
> > Manoj


Re: HDFS ORC to Arrow Dataset

2021-09-07 Thread Wes McKinney
I'm missing context but if you're talking about C++/Python, we are
currently missing a wrapper interface to the ORC reader in the Arrow
datasets library

https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset

We have CSV, Arrow (IPC), and Parquet interfaces.

But we have an HDFS filesystem implementation and an ORC reader
implementation, so mechanically all of the pieces are there but need
to be connected together.

Thanks,
Wes

On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar  wrote:
>
> Hi Dev-Community,
>
> Anyone can help me to guide how to read ORC directly from HDFS to an
> arrow dataset.
>
> Thanks
> Manoj


Re: HTTP traffic of Arrow Flight

2021-09-07 Thread Mohamed Abdelhakem
Yes, I got it, I have to do decode as and choose HTTP2 protocol
Thanks a lot

On 2021/09/07 17:06:10, "David Li"  wrote: 
> Yes and to be extra clear, Flight currently only supports gRPC, and hence 
> HTTP/2 (barring a few hypothetical configurations), it may also be that you 
> need to explicitly tell WireShark the protocol in use.
> 
> -David
> 
> On Tue, Sep 7, 2021, at 13:03, Nate Bauernfeind wrote:
> > HTTP (and HTTP/2) traffic is sent over TCP. You might need to be more
> > specific, or possibly do some more research on your end
> > 
> > Which arrow flight client are you using in your test? Java? C++? Which
> > version? Can you provide a simple gRPC server/client example that shows up
> > in WireShark as you expect it?
> > 
> > Nate
> > 
> > On Tue, Sep 7, 2021 at 10:40 AM Mohamed Abdelhakem <
> > mohamed.abdelha...@incorta.com> wrote:
> > 
> > > When I built a simple FlightServer and FlightClient, I noticed that the
> > > traffic captured by WireShark is TCP, not HTTP/2
> > > MY question is how to configure Arrow Flight to use HTTP/2 protocol 
> > > traffic
> > >
> > 
> > 
> > --
> > 
> 


Re: HTTP traffic of Arrow Flight

2021-09-07 Thread Mohamed Abdelhakem
I am using Java Flight Client using Arrow Flight gRPC version 5.0

On 2021/09/07 17:03:42, Nate Bauernfeind  wrote: 
> HTTP (and HTTP/2) traffic is sent over TCP. You might need to be more
> specific, or possibly do some more research on your end
> 
>  Which arrow flight client are you using in your test? Java? C++? Which
> version? Can you provide a simple gRPC server/client example that shows up
> in WireShark as you expect it?
> 
> Nate
> 
> On Tue, Sep 7, 2021 at 10:40 AM Mohamed Abdelhakem <
> mohamed.abdelha...@incorta.com> wrote:
> 
> > When I built a simple FlightServer and FlightClient, I noticed that the
> > traffic captured by WireShark is TCP, not HTTP/2
> > MY question is how to configure Arrow Flight to use HTTP/2 protocol traffic
> >
> 
> 
> --
> 


Re: HTTP traffic of Arrow Flight

2021-09-07 Thread David Li
Yes and to be extra clear, Flight currently only supports gRPC, and hence 
HTTP/2 (barring a few hypothetical configurations), it may also be that you 
need to explicitly tell WireShark the protocol in use.

-David

On Tue, Sep 7, 2021, at 13:03, Nate Bauernfeind wrote:
> HTTP (and HTTP/2) traffic is sent over TCP. You might need to be more
> specific, or possibly do some more research on your end
> 
> Which arrow flight client are you using in your test? Java? C++? Which
> version? Can you provide a simple gRPC server/client example that shows up
> in WireShark as you expect it?
> 
> Nate
> 
> On Tue, Sep 7, 2021 at 10:40 AM Mohamed Abdelhakem <
> mohamed.abdelha...@incorta.com> wrote:
> 
> > When I built a simple FlightServer and FlightClient, I noticed that the
> > traffic captured by WireShark is TCP, not HTTP/2
> > MY question is how to configure Arrow Flight to use HTTP/2 protocol traffic
> >
> 
> 
> --
> 


Re: HTTP traffic of Arrow Flight

2021-09-07 Thread Nate Bauernfeind
HTTP (and HTTP/2) traffic is sent over TCP. You might need to be more
specific, or possibly do some more research on your end

 Which arrow flight client are you using in your test? Java? C++? Which
version? Can you provide a simple gRPC server/client example that shows up
in WireShark as you expect it?

Nate

On Tue, Sep 7, 2021 at 10:40 AM Mohamed Abdelhakem <
mohamed.abdelha...@incorta.com> wrote:

> When I built a simple FlightServer and FlightClient, I noticed that the
> traffic captured by WireShark is TCP, not HTTP/2
> MY question is how to configure Arrow Flight to use HTTP/2 protocol traffic
>


--


HTTP traffic of Arrow Flight

2021-09-07 Thread Mohamed Abdelhakem
When I built a simple FlightServer and FlightClient, I noticed that the traffic 
captured by WireShark is TCP, not HTTP/2
MY question is how to configure Arrow Flight to use HTTP/2 protocol traffic


Fwd: HDFS ORC to Arrow Dataset

2021-09-07 Thread Manoj Kumar
Hi Dev-Community,

Anyone can help me to guide how to read ORC directly from HDFS to an
arrow dataset.

Thanks
Manoj


Re: [Question] Allocations along 64 byte cache lines

2021-09-07 Thread Jorge Cardoso Leitão
Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai  wrote:

> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> enough conditions, looks it does shows unaligned access has some penalty
> over aligned access. But I don't think this is an issue in practice.
>
> Please be very skeptical to this benchmark. It's hard to get it right
> given the complexity of hardware, compiler, benchmark tool and env.
>
> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
>
>
> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >
> >
> > This is probably true.  [1] is the original mailing list discussion.  I
> > think lack of measurable differences and high overhead for 64 byte
> > alignment was the reason for relaxing to 8 byte alignment.
> >
> > Specifically, I performed two types of tests, a "random sum" where we
> >> compute the sum of the values taken at random indices, and "sum", where
> we
> >> sum all values of the array (buffer[1] of the primitive array), both for
> >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> least in
> >> the latter, prefetching would help, but I do not observe any difference.
> >
> >
> > The most likely place I think where this could make a difference would be
> > for operations on wider types (Decimal128 and Decimal256).   Another
> place
> > where I think alignment could help is when adding two primitive arrays
> (it
> > sounds like this was summing a single array?).
> >
> > [1]
> >
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >
> > On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> >>> Thanks a lot Antoine for the pointers. Much appreciated!
> >>>
> >>> Generally, it should not hurt to align allocations to 64 bytes anyway,
>  since you are generally dealing with large enough data that the
>  (small) memory overhead doesn't matter.
> >>>
> >>> Not for performance. However, 64 byte alignment in Rust requires
> >>> maintaining a custom container, a custom allocator, and the inability
> to
> >>> interoperate with `std::Vec` and the ecosystem that is based on it,
> since
> >>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> >> anyone
> >>> interested, the background for this is this old PR [1] in this in
> arrow2
> >>> [2].
> >>
> >> I see. In the C++ implementation, we are not compatible with the default
> >> allocator either (but C++ allocators as defined by the standard library
> >> don't support resizing, which doesn't make them terribly useful for
> >> Arrow anyway).
> >>
> >>> Neither myself in micro benches nor Ritchie from polars (query engine)
> in
> >>> large scale benches observe any difference in the archs we have
> >> available.
> >>> This is not consistent with the emphasis we put on the memory
> alignments
> >>> discussion [3], and I am trying to understand the root cause for this
> >>> inconsistency.
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >>
> >>> By prefetching I mean implicit; no intrinsics involved.
> >>
> >> Well, I'm not aware that implicit prefetching depends on alignment.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>