RE: [VOTE] Release Apache Arrow 15.0.0 - RC1

2024-01-17 Thread Yibo Cai
+1 (binding)

Verified cpp/python/go on ubuntu20.04, aarch64

TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 
dev/release/verify-release-candidate.sh 15.0.0 1

-Original Message-
From: Raúl Cumplido 
Sent: Wednesday, January 17, 2024 18:58
To: dev@arrow.apache.org
Subject: [VOTE] Release Apache Arrow 15.0.0 - RC1

Hi,

I would like to propose the following release candidate (RC1) of Apache Arrow 
version 15.0.0. This is a release consisting of 330 resolved GitHub issues[1].

This release candidate is based on commit:
a61f4af724cd06c3a9b4abd20491345997e532c0 [2]

The source release rc1 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 15.0.0 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 15.0.0 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A15.0.0+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/a61f4af724cd06c3a9b4abd20491345997e532c0
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-15.0.0-rc1
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/15.0.0-rc1
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/15.0.0-rc1
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/15.0.0-rc1
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/a61f4af724cd06c3a9b4abd20491345997e532c0/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/39641
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread Yibo Cai

Congrats Andy!

On 11/28/23 06:03, L. C. Hsieh wrote:

Congrats Andy!

Thanks Andrew for the efforts to lead the Arrow project in the past year!

On Tue, Nov 28, 2023 at 3:51 AM Krisztián Szűcs
 wrote:


Congrats Andy & Thanks Andrew!

On Mon, Nov 27, 2023 at 6:55 PM Chao Sun  wrote:


Congratulations Andy! And thanks Andrew for the awesome work in the past year!

Chao

On Mon, Nov 27, 2023 at 9:51 AM Jeremy Dyer  wrote:


Thanks for your leadership this past year Andrew and I know we are in good 
hands with Andy going forward. Congrats Andy!

- Jeremy Dyer

Get Outlook for iOS

From: Nic Crane 
Sent: Monday, November 27, 2023 11:18:33 AM
To: dev@arrow.apache.org 
Subject: Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

Congrats Andy!

On Mon, 27 Nov 2023 at 15:17, Gang Wu  wrote:


Congrats Andy!

Thanks Andrew for the past year as well.

Best,
Gang

On Mon, Nov 27, 2023 at 10:59 PM Matt Topol 
wrote:


Congrats Andy!

On Mon, Nov 27, 2023 at 9:44 AM Gavin Ray  wrote:


Yay, congrats Andy! Well-deserved!

On Mon, Nov 27, 2023 at 9:13 AM Kevin Gurney




wrote:


Congratulations, Andy!

From: Raúl Cumplido 
Sent: Monday, November 27, 2023 8:58 AM
To: dev@arrow.apache.org 
Subject: Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

Congratulations Andy and thanks for the effort during last year

Andrew!


El lun, 27 nov 2023 a las 14:54, David Li ()
escribió:


Congrats Andy!

On Mon, Nov 27, 2023, at 08:02, Mehmet Ozan Kabak wrote:

Congratulations Andy. I am sure we will keep building great tech

this

year, just like last year, under your watch.

Mehmet Ozan Kabak



On Nov 27, 2023, at 3:47 PM, Daniël Heres <

danielhe...@gmail.com>

wrote:


Congrats Andy!

Op ma 27 nov 2023 om 13:47 schreef Andrew Lamb <

al...@influxdata.com

:



I am pleased to announce that the Arrow Project has a new PMC

chair

and VP

as per our tradition of rotating the chair once a year. I have

resigned and

Andy Grove was duly elected by the PMC and approved unanimously

by

the

board.

Please join me in congratulating Andy Grove!

Thanks,
Andrew




--
Daniël Heres











RE: [VOTE] Release Apache Arrow 14.0.0 - RC2

2023-10-25 Thread Yibo Cai
+1

Verified cpp/python/go on Arm64 ubuntu-20.04. No blocking issue found.

TEST_DEFAULT=0 \
TEST_CPP=1 \
TEST_PYTHON=1 \
TEST_GO=1 \
dev/release/verify-release-candidate.sh 14.0.0 2


-Original Message-
From: Sutou Kouhei 
Sent: Wednesday, October 25, 2023 14:03
To: dev@arrow.apache.org
Subject: Re: [VOTE] Release Apache Arrow 14.0.0 - RC2

+1

I ran the followings on Debian GNU/Linux sid:

  * TEST_DEFAULT=0 \
  TEST_SOURCE=1 \
  LANG=C \
  TZ=UTC \
  CUDAToolkit_ROOT=/usr \
  ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON -Dxsimd_SOURCE=BUNDLED" \
  dev/release/verify-release-candidate.sh 14.0.0 2

  * TEST_DEFAULT=0 \
  TEST_APT=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 14.0.0 2

  * TEST_DEFAULT=0 \
  TEST_BINARY=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 14.0.0 2

  * TEST_DEFAULT=0 \
  TEST_JARS=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 14.0.0 2

  * TEST_DEFAULT=0 \
  TEST_PYTHON_VERSIONS=3.11 \
  TEST_WHEELS=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 14.0.0 2

  * TEST_DEFAULT=0 \
  TEST_YUM=1 \
  LANG=C \
  dev/release/verify-release-candidate.sh 14.0.0 2

with:

  * .NET SDK (7.0.401)
  * Python 3.11.5
  * gcc (Debian 13.2.0-4) 13.2.0
  * nvidia-cuda-dev 12.0.146~12.0.1-2
  * openjdk version "17.0.9-ea" 2023-10-17
  * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]


Thanks,
--
kou


In 
  "[VOTE] Release Apache Arrow 14.0.0 - RC2" on Tue, 24 Oct 2023 09:17:59 +0200,
  Raúl Cumplido  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC2) of Apache
> Arrow version 14.0.0. This is a release consisting of 461
> resolved GitHub issues[1].
>
> This release candidate is based on commit:
> 2dcee3f82c6cf54b53a64729fd81840efa583244 [2]
>
> The source release rc2 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
>
> See also a verification result on GitHub pull request [14].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 14.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 14.0.0 because...
>
> [1]: 
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A14.0.0+is%3Aclosed
> [2]: 
> https://github.com/apache/arrow/tree/2dcee3f82c6cf54b53a64729fd81840efa583244
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-14.0.0-rc2
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/14.0.0-rc2
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/14.0.0-rc2
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/14.0.0-rc2
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]: 
> https://github.com/apache/arrow/blob/2dcee3f82c6cf54b53a64729fd81840efa583244/CHANGELOG.md
> [13]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> [14]: https://github.com/apache/arrow/pull/38343
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Yibo Cai
Congrats Xuwei!

-Original Message-
From: Gang Wu 
Sent: Monday, October 23, 2023 13:29
To: dev@arrow.apache.org
Subject: Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

Congrats Xuwei!

Best,
Gang

On Mon, Oct 23, 2023 at 12:56 PM Sutou Kouhei  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu has
> accepted an invitation to become a committer on Apache Arrow. Welcome,
> and thank you for your contributions!
>
> --
> kou
>
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-24 Thread Yibo Cai

+1.

Verified c++/python/go source on Ubuntu-22.04 aarch64.

TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 \
dev/release/verify-release-candidate.sh 13.0.0 0

Met with a non-blocking issue:
https://github.com/apache/arrow/issues/36860

On 7/21/23 17:49, Raúl Cumplido wrote:

Hi,

As discussed during the community calls I have also triggered the
benchmark tests on the Pull Request for RC 0 [1].

I am trying to get the conbench comparison between the 13.0.0 RC0 and
12.0.1 RC1 (latest release) by having a chat with the conbench
maintainers. I'll share as soon as I have it.

I wanted to share the Verification email as soon as possible so we can
start running the verification process.

Thanks,
Raúl

[1] https://github.com/apache/arrow/pull/36775#issuecomment-1645088676

El vie, 21 jul 2023 a las 11:45, Raúl Cumplido () escribió:


Hi,

I would like to propose the following release candidate (RC0) of Apache
Arrow version 13.0.0. This is a release consisting of 428
resolved GitHub issues[1].

This release candidate is based on commit:
ac2d207611ce25c91fb9fc90d5eaff2933609660 [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 13.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 13.0.0 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/36775


Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-03 Thread Yibo Cai

Congrats Matt!

On 5/4/23 07:07, Krisztián Szűcs wrote:

Congrats Matt!

On Wed, May 3, 2023 at 11:44 PM Rok Mihevc  wrote:


Congrats Matt. Well deserved!

Rok

On Wed, May 3, 2023 at 11:03 PM David Li  wrote:


Congrats Matt!

On Wed, May 3, 2023, at 16:06, Neal Richardson wrote:

Congratulations!

On Wed, May 3, 2023 at 1:58 PM Jacob Wujciak



wrote:


Congratulations, well deserved!

On Wed, May 3, 2023 at 7:48 PM Weston Pace 

wrote:



Congratulations!

On Wed, May 3, 2023 at 10:47 AM Raúl Cumplido 


wrote:


Congratulations Matt!

El mié, 3 may 2023, 19:44, vin jake 

escribió:



Congratulations, Matt!

Felipe Oliveira Carvalho  于 2023年5月4日周四

01:42写道:



Congratulations, Matt!

On Wed, 3 May 2023 at 14:37 Andrew Lamb 

wrote:



The Project Management Committee (PMC) for Apache Arrow has

invited

Matt Topol (zeroshade) to become a PMC member and we are

pleased

to

announce
that Matt has accepted.

Congratulations and welcome!















Re: [VOTE] Release Apache Arrow 12.0.0 - RC0

2023-04-23 Thread Yibo Cai

+1

I ran the followings on Ubuntu-22.04, aarch64.

TEST_DEFAULT=0 \
  TEST_CPP=1 \
  TEST_PYTHON=1 \
  TEST_GO=1 \
  dev/release/verify-release-candidate.sh 12.0.0 0

TEST_DEFAULT=0 \
  TEST_WHEELS=1 \
  dev/release/verify-release-candidate.sh 12.0.0 0


On 4/23/23 14:40, Sutou Kouhei wrote:

+1

I ran the followings on Debian GNU/Linux sid:

   * TEST_DEFAULT=0 \
   TEST_SOURCE=1 \
   LANG=C \
   TZ=UTC \
   CUDAToolkit_ROOT=/usr \
   ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON -Dxsimd_SOURCE=BUNDLED" \
   dev/release/verify-release-candidate.sh 12.0.0 0

   * TEST_DEFAULT=0 \
   TEST_APT=1 \
   LANG=C \
   dev/release/verify-release-candidate.sh 12.0.0 0

   * TEST_DEFAULT=0 \
   TEST_BINARY=1 \
   LANG=C \
   dev/release/verify-release-candidate.sh 12.0.0 0

   * TEST_DEFAULT=0 \
   TEST_JARS=1 \
   LANG=C \
   dev/release/verify-release-candidate.sh 12.0.0 0

   * TEST_DEFAULT=0 \
   TEST_PYTHON_VERSIONS=3.11 \
   TEST_WHEELS=1 \
   LANG=C \
   dev/release/verify-release-candidate.sh 12.0.0 0

   * TEST_DEFAULT=0 \
   TEST_YUM=1 \
   LANG=C \
   dev/release/verify-release-candidate.sh 12.0.0 0

with:

   * .NET SDK (6.0.406)
   * Python 3.11.2
   * gcc (Debian 12.2.0-14) 12.2.0
   * nvidia-cuda-dev 11.7.99~11.7.1-4
   * openjdk version "17.0.6" 2023-01-17
   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]


Thanks,


Re: [DISCUSS] The default commit message for merge button

2023-01-31 Thread Yibo Cai

+1 for title and description

For purely personal reason. I care about commit messages and often try 
writing informative messages (even following pedantic rules like 72 
chars length). It's a bit disappointed if the messages are not shown by 
`git log`.



On 2/1/23 10:46, Jacob Wujciak wrote:

Yes same here I personally don't use the git log much but I am fine with
title + description if it is beneficial for others

Joris Van den Bossche  schrieb am Di., 31.
Jan. 2023, 22:28:


And to be explicit, as mentioned in my original post, I am fine with
both, so +1 on title + description

On Tue, 31 Jan 2023 at 21:53, Matthew Topol
 wrote:


(non-binding)  I am also +1 for title + description

On Tue, Jan 31, 2023 at 11:17 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:


+1, with or without description

Neal

On Tue, Jan 31, 2023 at 11:04 AM Weston Pace 
wrote:


+1 to both.  This only applies to the merge button and, in that

case, the

committer has a chance to review the message before merging.  So if

there

is garbage in the description hopefully they can catch this here and

adjust

it or ask for a cleanup of the description.

Also, it seems we're trying to be more "conventional"[1] with things

like

BREAKING CHANGE in the description.  If we want to go that route the
description is necessary.

[1] https://www.conventionalcommits.org/en/v1.0.0/

On Tue, Jan 31, 2023, 5:49 AM David Li  wrote:


I am also +1 for title + description.

On Tue, Jan 31, 2023, at 05:16, Felipe Oliveira Carvalho wrote:

+1 for "pull request title *and* description".

Being able to read descriptions without leaving the editor is

handy.

Keeping that information tracked in the repo means we don’t

depend on

GitHub to reconstruct the history of the project.

On Tue, 31 Jan 2023 at 06:43 Antoine Pitrou 

wrote:




+1 for "pull request title *and* description".

I'd rather have the description recorded in git than have to

look

up a

PR to get more explanations. Also, we don't know what Github

will

have

become in 10 years.



Le 31/01/2023 à 09:53, Joris Van den Bossche a écrit :

I would personally prefer to use just "Pull request title"

instead

of

"Pull request title and description".

In my experience, including the description in the commit

message

(as

we already do) more often gives noise to the output of `git

log`,

and

you can always go from the commit to the PR to see the full

context.

In many cases, the description is quite verbose or contain

long

examples, or might be outdated (written when the PR was

opened,

but

the PR might have changed along the review process), ...

Especially

now that we have the github PR template with sections, they

might

become even more verbose.

Personally, when opening a PR myself, I often leave the top

post

empty, to add a second comment with more explanation, exactly

to

avoid

including that in the commit message if that doesn't seem

useful

to

me.

Anyway, I am certainly OK with both options if the general

consensus

is for "Pull request title and description" (and certainly if

that

enables actually using the merge button), but just stating my

personal

preference.

On Tue, 31 Jan 2023 at 09:33, Raúl Cumplido <

raulcumpl...@gmail.com



wrote:


+1
We already do it on the merge script and we have already

changed

it

on

the

`arrow-site` repo.

El mar, 31 ene 2023 a las 9:13, Sutou Kouhei (<

k...@clear-code.com

)

escribió:


Hi,

We need to get consensus to change the default commit
message for merge button:

https://issues.apache.org/jira/browse/INFRA-24133


Could you change the default commit message when merging a
PR to "Default to pull request title and description" on
the following Apache Arrow related repositories?

* https://github.com/apache/arrow
* https://github.com/apache/arrow-adbc
* https://github.com/apache/arrow-cookbook
* https://github.com/apache/arrow-flight-sql-postgresql
* https://github.com/apache/arrow-julia
* https://github.com/apache/arrow-nanoarrow
* https://github.com/apache/arrow-testing

See also:












https://github.blog/changelog/2022-08-23-new-options-for-controlling-the-default-commit-message-when-merging-a-pull-request/


Related: https://issues.apache.org/jira/browse/INFRA-24122




https://issues.apache.org/jira/browse/INFRA-24133#comment-17682383



Please provide a mailing list link to project consensus on

this

change.



How about changing the default commit message for merge
button to "Default to pull request title and description"
like our dev/merge_arrow_pr.py in apache/arrow does?

Note that this doesn't mean that we drop
dev/merge_arrow_pr.py immediately. It's a separated
discussion.


Thanks,
--
kou















RE: [ANNOUNCE] New Arrow PMC chair: Andrew Lamb

2022-12-26 Thread Yibo Cai
Congratulations!

-Original Message-
From: Rok Mihevc 
Sent: Tuesday, December 27, 2022 7:57 AM
To: dev@arrow.apache.org
Subject: Re: [ANNOUNCE] New Arrow PMC chair: Andrew Lamb

Congratulations Andrew!

Rok

On Mon, Dec 26, 2022 at 11:26 PM Neal Richardson < neal.p.richard...@gmail.com> 
wrote:

> Congratulations!
>
> On Mon, Dec 26, 2022 at 4:38 PM Matt Topol  wrote:
>
> > Congrats!!!
> >
> > On Mon, Dec 26, 2022, 12:47 PM Jacob Wujciak
>  > >
> > wrote:
> >
> > > Congratulations Andrew!
> > >
> > > Matthew Turner  schrieb am Mo., 26. Dez.
> > > 2022, 16:44:
> > >
> > > > Congratulations, Andrew!
> > > >
> > > > From: Yijie Shen 
> > > > Date: Monday, December 26, 2022 at 8:14 AM
> > > > To: dev@arrow.apache.org 
> > > > Subject: Re: [ANNOUNCE] New Arrow PMC chair: Andrew Lamb
> > > > Congrats Andrew!
> > > >
> > > > On Mon, Dec 26, 2022 at 20:37 Wang Xudong
> > > > 
> > > wrote:
> > > >
> > > > > Congratulations Andrew!
> > > > >
> > > > > Thank you for your dedication to arrow rust ecosystem!
> > > > >
> > > > > Willy Kuo  于2022年12月26日周一 20:13写道:
> > > > >
> > > > > > Congratulations Andrew!
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > > On Dec 26, 2022, at 7:48 PM, Nic Crane
> > > > > > > 
> > > wrote:
> > > > > > >
> > > > > > > Congratulations!
> > > > > > >
> > > > > > >> On Mon, 26 Dec 2022, 11:01 Daniël Heres, <
> danielhe...@gmail.com
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> Congrats Andrew!
> > > > > > >>
> > > > > > >>> On Mon, Dec 26, 2022, 09:00 L. C. Hsieh
> > > > > > >>> 
> > > wrote:
> > > > > > >>>
> > > > > > >>> Congratulations!
> > > > > > >>>
> > > > > > >>> On Sun, Dec 25, 2022 at 10:36 PM Weston Pace <
> > > > weston.p...@gmail.com>
> > > > > > >>> wrote:
> > > > > > 
> > > > > >  Congratulations!
> > > > > > 
> > > > > >  On Sun, Dec 25, 2022, 9:44 PM Remzi Yang <
> > > 1371656737...@gmail.com
> > > > >
> > > > > > >>> wrote:
> > > > > > 
> > > > > > > Congratulation Andrew!
> > > > > > >
> > > > > > > On Mon, 26 Dec 2022 at 13:40, David Li <
> lidav...@apache.org>
> > > > > wrote:
> > > > > > >
> > > > > > >> Congrats Andrew!
> > > > > > >>
> > > > > > >> On Mon, Dec 26, 2022, at 00:26, vin jake wrote:
> > > > > > >>> congratulation!
> > > > > > >>>
> > > > > > >>> Sutou Kouhei  于 2022年12月26日周一
> 12:54写道:
> > > > > > >>>
> > > > > >  I am pleased to announce that we have a new PMC
> > > > > >  chair
> and
> > VP
> > > > as
> > > > > > >>> per
> > > > > >  our newly started tradition of rotating the chair
> > > > > >  once a
> > > > year. I
> > > > > > >>> have
> > > > > >  resigned and Andrew Lamb was duly elected by the
> > > > > >  PMC and
> > > > > > >> approved
> > > > > >  unanimously by the board. Please join me in
> congratulating
> > > > > > >> Andrew
> > > > > > > Lamb!
> > > > > > 
> > > > > >  Thanks,
> > > > > >  --
> > > > > >  kou
> > > > > > 
> > > > > > >>
> > > > > > >
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [VOTE] Release Apache Arrow 10.0.0 - RC0

2022-10-23 Thread Yibo Cai

+1

Verified C++/Python/Go source on ubuntu20.04, aarch64.

TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 \
dev/release/verify-release-candidate.sh 10.0.0 0


On 10/21/22 14:06, Sutou Kouhei wrote:

Hi,

I would like to propose the following release candidate (RC0) of Apache
Arrow version 10.0.0. This is a release consisting of 470
resolved JIRA issues[1].

This release candidate is based on commit:
89f9a0948961f6e94f1ef5e4f310b707d22a3c11 [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 10.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 10.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%2010.0.0
[2]: 
https://github.com/apache/arrow/tree/89f9a0948961f6e94f1ef5e4f310b707d22a3c11
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-10.0.0-rc0
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/10.0.0-rc0
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/10.0.0-rc0
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/10.0.0-rc0
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/89f9a0948961f6e94f1ef5e4f310b707d22a3c11/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/14466


nightly job failures

2022-09-26 Thread Yibo Cai

There are some nightly job failures [1].
Pasted some logs below, not sure if already reported.

osx related
---
https://github.com/ursacomputing/crossbow/actions/runs/3125303015/jobs/5069525838#step:7:747
https://github.com/ursacomputing/crossbow/actions/runs/3125257793/jobs/5069437237#step:7:2844

thrift link broken
--
https://github.com/ursacomputing/crossbow/actions/runs/3125253061/jobs/5069428221#step:8:1758

thread sanitizer

https://github.com/ursacomputing/crossbow/actions/runs/3125311912/jobs/5069545124#step:5:5153

asof-join-node-test
---
https://github.com/ursacomputing/crossbow/actions/runs/3125283610/jobs/5069484524#step:4:2754

[1] https://github.com/apache/arrow/pull/14201#issuecomment-1257485286


Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-05 Thread Yibo Cai
Congratulations Weston!

From: Andrew Lamb 
Sent: Monday, September 5, 2022 9:18 PM
To: dev 
Subject: Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

Congratulations Weston!

On Mon, Sep 5, 2022 at 8:09 AM Rok Mihevc  wrote:

> Congrats Weston!
>
> Rok
>
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [VOTE] C++: switch to C++17

2022-08-24 Thread Yibo Cai

+1 (binding)

On 8/25/22 04:03, Mauricio Vargas Sepúlveda wrote:

+1

On 2022-08-24 12:50, Weston Pace wrote:

+1 (non-binding)

On Wed, Aug 24, 2022 at 9:24 AM Keith Kraus
 wrote:

+1 (non-binding)

On Wed, Aug 24, 2022 at 12:12 PM David Li  wrote:


+1 (binding)

On Wed, Aug 24, 2022, at 12:06, Ivan Ogasawara wrote:

+1 (non-binding)

On Wed, Aug 24, 2022 at 12:00 PM Sasha Krassovsky <

krassovskysa...@gmail.com>

wrote:


++1 (non-binding)


24 авг. 2022 г., в 08:53, Jacob Wujciak 
написал(а):

+ 1 (non-binding)

Benjamin Kietzman  schrieb am Mi., 24. Aug.

2022,

17:43:


+1 (binding)

On Wed, Aug 24, 2022, 11:32 Antoine Pitrou 

wrote:

Hello,
I would like to propose that the Arrow C++ implementation switch to
C++17 as its baseline supported version (currently C++11).
The rationale and subsequent discussion can be read in the archives

here:

https://lists.apache.org/thread/9g14n3odhj6kzsgjxr6k6d3q73hg2njr
The exact steps and timeline for switching can be decided later on,

but

this proposal implies that it could happen soon, possibly next week

:-)

... or, more realistically, in the next Arrow C++ release, 10.0.0.
The vote will be open for at least 72 hours.
[ ] +1 Switch to C++17 in the impeding future
[ ] +0
[ ] -1 Do not switch to C++17 because...
Regards
Antoine.


Re: [VOTE] Release Apache Arrow 9.0.0 - RC2

2022-08-02 Thread Yibo Cai

+1 (binding)

Verified source (cpp/python/go) and wheels on ubuntu-20.04 aarch64.

TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 
dev/release/verify-release-candidate.sh 9.0.0 2


TEST_DEFAULT=0 TEST_WHEELS=1 dev/release/verify-release-candidate.sh 9.0.0 2


On 7/30/22 07:10, Krisztián Szűcs wrote:

Hi,

I would like to propose the following release candidate (RC2) of Apache
Arrow version 9.0.0. This is a release consisting of 507
resolved JIRA issues[1].

This release candidate is based on commit:
ea6875fd2a3ac66547a9a33c5506da94f3ff07f2 [2]

The source release rc2 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 9.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 9.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%209.0.0
[2]: 
https://github.com/apache/arrow/tree/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-9.0.0-rc2
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/9.0.0-rc2
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/9.0.0-rc2
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/9.0.0-rc2
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/ea6875fd2a3ac66547a9a33c5506da94f3ff07f2/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


RE: [VOTE] Release Apache Arrow 6.0.2 - RC0

2022-07-14 Thread Yibo Cai
+1, verified on arm64

-Original Message-
From: Sutou Kouhei 
Sent: Friday, July 15, 2022 5:11 AM
To: dev@arrow.apache.org
Subject: [VOTE] Release Apache Arrow 6.0.2 - RC0

Hi,

I would like to propose the following release candidate
(RC0) of Apache Arrow version 6.0.2. This is a release consisting of 1 resolved 
JIRA issues[1].

This is one of releases[2] that focus on a Go related security 
vulnerability[3]. We don't publish binary artifacts of this release because we 
don't have Go related binaries.

This release candidate is based on commit:
3ea5af64865f9910d3c98162c7949af8d63ec68e [4]

The source release rc0 is hosted at [5].
The changelog is located at [6].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [7] for how to validate a release candidate. But you need 
to verify only Go related tests because this release candidate only includes a 
change for Go. So we can use the following command line:

TEST_DEFAULT=0 TEST_GO=1 dev/release/verify-release-candidate.sh 6.0.2 0

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 6.0.2 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 6.0.2 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%206.0.2
[2] https://lists.apache.org/thread/qkkzpvmxc0coqhdkc1qoygwy6h4v5sgn
[3] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-28948
[4]: 
https://github.com/apache/arrow/tree/3ea5af64865f9910d3c98162c7949af8d63ec68e
[5]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-6.0.2-rc0
[6]: 
https://github.com/apache/arrow/blob/3ea5af64865f9910d3c98162c7949af8d63ec68e/CHANGELOG.md
[7]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [VOTE] Release Apache Arrow 7.0.1 - RC0

2022-07-14 Thread Yibo Cai
+1, verified on arm64

-Original Message-
From: Sutou Kouhei 
Sent: Friday, July 15, 2022 5:11 AM
To: dev@arrow.apache.org
Subject: [VOTE] Release Apache Arrow 7.0.1 - RC0

Hi,

I would like to propose the following release candidate
(RC0) of Apache Arrow version 7.0.1. This is a release consisting of 1 resolved 
JIRA issues[1].

This is one of releases[2] that focus on a Go related security 
vulnerability[3]. We don't publish binary artifacts of this release because we 
don't have Go related binaries.

This release candidate is based on commit:
072ae55dc8172bb1a898fda5d5a83ec063b05a6d [4]

The source release rc0 is hosted at [5].
The changelog is located at [6].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [7] for how to validate a release candidate. But you need 
to verify only Go related tests because this release candidate only includes a 
change for Go. So we can use the following command line:

TEST_DEFAULT=0 TEST_GO=1 dev/release/verify-release-candidate.sh 7.0.1 0

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 7.0.1 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 7.0.1 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%207.0.1
[2] https://lists.apache.org/thread/qkkzpvmxc0coqhdkc1qoygwy6h4v5sgn
[3] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-28948
[4]: 
https://github.com/apache/arrow/tree/072ae55dc8172bb1a898fda5d5a83ec063b05a6d
[5]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.1-rc0
[6]: 
https://github.com/apache/arrow/blob/072ae55dc8172bb1a898fda5d5a83ec063b05a6d/CHANGELOG.md
[7]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [VOTE] Release Apache Arrow 8.0.1 - RC0

2022-07-14 Thread Yibo Cai
+1, verified on arm64

-Original Message-
From: Sutou Kouhei 
Sent: Friday, July 15, 2022 5:11 AM
To: dev@arrow.apache.org
Subject: [VOTE] Release Apache Arrow 8.0.1 - RC0

Hi,

I would like to propose the following release candidate
(RC0) of Apache Arrow version 8.0.1. This is a release consisting of 1 resolved 
JIRA issues[1].

This is one of releases[2] that focus on a Go related security 
vulnerability[3]. We don't publish binary artifacts of this release because we 
don't have Go related binaries.

This release candidate is based on commit:
9966c39583f1e203bac9200753e9db32478d43a6 [4]

The source release rc0 is hosted at [5].
The changelog is located at [6].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [7] for how to validate a release candidate. But you need 
to verify only Go related tests because this release candidate only includes a 
change for Go. So we can use the following command line:

TEST_DEFAULT=0 TEST_GO=1 dev/release/verify-release-candidate.sh 8.0.1 0

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 8.0.1 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 8.0.1 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%208.0.1
[2] https://lists.apache.org/thread/qkkzpvmxc0coqhdkc1qoygwy6h4v5sgn
[3] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-28948
[4]: 
https://github.com/apache/arrow/tree/9966c39583f1e203bac9200753e9db32478d43a6
[5]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-8.0.1-rc0
[6]: 
https://github.com/apache/arrow/blob/9966c39583f1e203bac9200753e9db32478d43a6/CHANGELOG.md
[7]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [VOTE] Accept donation of Flight SQL JDBC driver

2022-06-30 Thread Yibo Cai

+1


On 7/1/22 04:20, David Li wrote:

Hello,

This vote is to determine if the Arrow PMC is in favor of accepting the 
donation of the Flight SQL JDBC driver.

This process was deemed necessary since there was significant development prior 
to opening the pull request. This was discussed in a previous ML thread [1].

The outline of the IP clearance form is at [2]. There will be further work 
needed to integrate the code into the main development branch, so we will merge 
the PR [3] into a separate branch for the time being.

The vote will be open for at least 72 hours.

[ ] +1 : Accept the donation
[ ] 0 : No opinion
[ ] -1 : Reject donation because...

My vote: +1

[1]: https://lists.apache.org/thread/q5mn906m5788zxppjpcqvltxppc4sol0
[2]: 
https://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-flight-sql-jdbc-driver.xml
[3]: https://github.com/apache/arrow/pull/12830

Best,
David


RE: Merge a pull request with GitHub API

2022-05-18 Thread Yibo Cai
+1

-Original Message-
From: Sutou Kouhei 
Sent: Wednesday, May 18, 2022 11:43 AM
To: dev@arrow.apache.org
Subject: Merge a pull request with GitHub API

Hi,

How about using GitHub API instead of local "git merge" to merge a pull request?


We use local "git merge" to merge a pull request in dev/merge_arrow_pr.py.

If we use "git merge" to merge a pull request, GitHub's Web UI shows "Closed" 
mark not "Merged" mark in a pull request page. This sometimes confuses new 
contributors. "Why was my pull request closed without merging?" See
https://github.com/apache/arrow/pull/12004#issuecomment-1031619771
for example.

If we use GitHub API
https://docs.github.com/en/rest/pulls/pulls#merge-a-pull-request
to merge a pull request, GitHub's Web UI shows "Merged" mark not "Closed" mark. 
See
https://github.com/apache/arrow/pull/13180 for example. I used GitHub API to 
merge the pull request.

And we don't need to create a local branch on local repository to merge a pull 
request. But we must specify ARROW_GITHUB_API_TOKEN to run 
dev/merge_arrow_pr.py.


See also:

* https://issues.apache.org/jira/browse/ARROW-16602
* https://github.com/apache/arrow/pull/13184


Thanks,
--
kou
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [VOTE] Release Apache Arrow 8.0.0 - RC3

2022-05-04 Thread Yibo Cai
+1.

Verified cpp/python/go source and apt binaries on ubuntu20.04, aarch64.

TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 TEST_GO=1 
dev/release/verify-release-candidate.sh 8.0.0 3
TEST_DEFAULT=0 TEST_APT=1 dev/release/verify-release-candidate.sh 8.0.0 3

-Original Message-
From: Krisztián Szűcs 
Sent: Wednesday, May 4, 2022 4:08 AM
To: dev 
Subject: [VOTE] Release Apache Arrow 8.0.0 - RC3

Hi,

I would like to propose the following release candidate (RC3) of Apache Arrow 
version 8.0.0. This is a release consisting of 608 resolved JIRA issues[1].

This release candidate is based on commit:
c3d031250a7fdcfee5e576833bf6f39097602c30 [2]

The source release rc3 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [13] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 8.0.0 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 8.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%208.0.0
[2]: 
https://github.com/apache/arrow/tree/c3d031250a7fdcfee5e576833bf6f39097602c30
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-8.0.0-rc3
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/8.0.0-rc3
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/8.0.0-rc3
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/8.0.0-rc3
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/c3d031250a7fdcfee5e576833bf6f39097602c30/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [C++] Replacing xsimd with compiler autovectorization

2022-03-29 Thread Yibo Cai
Hi Sasha,

Thanks for the advice. I didn't quite catch the point. Would you explain a bit 
the purpose of this proposal?

We do prefer compiler auto-vectorization to explicit simd code, even if the c++ 
code is slower than simd one (20% is acceptable IMO). And we do support runtime 
dispatch kernels based on target machine arch.

Then what is left to talk is how to deal with codes that are not 
auto-vectorizable but can be manually optimized with simd instructions. Looks 
your proposal is to do nothing more than adding appropriate compiler flags and 
wait for compilers become smarter in the future. I think this is a reasonable 
approach, probably is many cases. But if we do want to manually tune the code, 
I believe a simd library is the best way.

To me there's no "replacing" between xsimd and auto-vectorization, they just do 
their own jobs.

Yibo

-Original Message-
From: Sasha Krassovsky 
Sent: Wednesday, March 30, 2022 6:58 AM
To: dev@arrow.apache.org; emkornfi...@gmail.com
Subject: Re: [C++] Replacing xsimd with compiler autovectorization

xsimd has three problems I can think of right now:
1) xsimd code looks like normal simd code: you have to explicitly do loads and 
stores, you have to explicitly unroll and stride through your loop, and you 
have to explicitly process the tail of the loop. This makes writing a large 
number of kernels extremely tedious and error-prone. In comparison, writing a 
single three-line scalar for loop is easier to both read and write.
2) xsimd limits the freedom an optimizer has to select instructions and do 
other optimizations, as it's just a thin wrapper over normal intrinsics.
One concrete example would be if we wanted to take advantage of the dynamic 
dispatch instruction set xsimd offers, the loop strides would no longer be 
compile-time constants, which might prevent the compiler from performing loop 
unrolling (how would it know that the stride isn't just 1?)
3) Lastly, if we ever want to support a new architecture (like Power9 or 
RISC-V), we'd have to wait for an xsimd backend to become available. On the 
other hand, if SiFive came out with a hot new chip supporting RV64V, all we'd 
have to do to support it is to add the appropriate compiler flag into the 
CMakeLists.

As for using an external build system, I'm not sure how much complexity it 
would add, but at the very least I suspect it would work out of the box if you 
only wanted to support scalar kernels. Otherwise I don't think it would add 
much more complexity than we currently have detecting architectures at 
buildtime.

Sasha

On Tue, Mar 29, 2022 at 3:26 PM Micah Kornfield 
wrote:

> Hi Sasha,
> Could you elaborate on the problems of the XSIMD dependency?  What you
> describe sounds a lot like what XSIMD provides in a prepackaged form
> and without the extra CMake magic.
>
> I  have to occasionally build Arrow with an external build system and
> it sounds like this type of logic could add complexity there.
>
> Thanks,
> Micah
>
> On Tue, Mar 29, 2022 at 3:14 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> wrote:
>
> > Hi everyone,
> > I've noticed that we include xsimd as an abstraction over all of the
> > simd architectures. I'd like to propose a different solution which
> > would
> result
> > in fewer lines of code, while being more readable.
> >
> > My thinking is that anything simple enough to abstract with xsimd
> > can be autovectorized by the compiler. Any more interesting SIMD
> > algorithm
> usually
> > is tailored to the target instruction set and can't be abstracted
> > away
> with
> > xsimd anyway.
> >
> > With that in mind, I'd like to propose the following strategy:
> > 1. Write a single source file with simple, element-at-a-time for
> > loop implementations of each function.
> > 2. Compile this same source file several times with different
> > compile
> flags
> > for different vectorization (e.g. if we're on an x86 machine that
> supports
> > AVX2 and AVX512, we'd compile once with -mavx2 and once with -mavx512vl).
> > 3. Functions compiled with different instruction sets can be
> differentiated
> > by a namespace, which gets defined during the compiler invocation.
> > For example, for AVX2 we'd invoke the compiler with -DNAMESPACE=AVX2
> > and then for something like elementwise addition of two arrays, we'd
> > call arrow::AVX2::VectorAdd.
> >
> > I believe this would let us remove xsimd as a dependency while also
> giving
> > us lots of vectorized kernels at the cost of some extra cmake magic.
> After
> > that, it would just be a matter of making the function registry
> > point to these new functions.
> >
> > Please let me know your thoughts!
> >
> > Thanks,
> > Sasha Krassovsky
> >
>
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the 

Re: Arm64 github runner

2022-02-14 Thread Yibo Cai

Unfortunately the runner may not be available. Will update if things change.

Yibo

On 11/12/21 6:57 PM, Krisztián Szűcs wrote:

On Wed, Nov 10, 2021 at 2:55 AM Yibo Cai  wrote:


Some updates, @kou, @kszucs

There are two kinds of runners provided. One is dynamic vm created on
demand like travis, suitable for github action runner to verify pr.
Another kind is static vm with pre-deployed os, simply an arm64 aws
cloud instance. Looks we prefer static vm as a crossbow runner.


Correct. I'm not aware of any way to deploy a github-actions runner to
ephemeral machines on demand.


Please note for security reasons, the runner cannot be accessed directly
from internet. Inbound connection requests are rejected. It can initiate
connections to other hosts. Is it okay for crossbow?


Shouldn't be a problem, see
https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#communication-between-self-hosted-runners-and-github


Any security concern from crossbow side?


Crossbow can only be triggered by trusted contributors, so with
reasonable precautions we can maintain a low risk setup.



On 10/22/21 10:07 AM, Yibo Cai wrote:

Thanks. I'm applying for the runner. Will update when ready.

On 10/22/21 6:31 AM, Krisztián Szűcs wrote:

On Thu, Oct 21, 2021 at 11:53 PM Sutou Kouhei  wrote:


Hi,

It's useful!

We have two options to use this:

 1. Use this on https://github.com/apache/arrow
 2. Use this on https://github.com/ursacomputing/crossbow/

1. is for CI of each commit/pull request.

2. is for CI of daily build and "@github-actions crossbow
submit ..." comment in pull request.

(We can choose both of them.)

If we choose 1., we need to ask INFRA for adding new
self-hosted runners. Because we don't have admin permission
of apache/arrow. Could you show a URL how to use the
self-hosted runners?

If we choose 2., we will able to do all needed work by
ourselves because we have admin permission of
ursacomputing/crossbow/.

We already have a self-hosted runner configured for crossbow where we
build the Apple M1 wheels.

I think we should start to configure the new runners for crossbow and
work out the details, and then later (if we choose to) get the
required registration tokens from INFRA.



Thanks,
--
kou

In <2a0074c7-562c-3a41-8763-6e1d4ac17...@arm.com>
 "Arm64 github runner" on Wed, 20 Oct 2021 12:43:16 +0800,
 Yibo Cai  wrote:


Hi,

We have free Arm64 instances (maintained by Arm) as github action
self-hosted runners for open source projects.
Arrow Arm CI is currently running on Travis. Is an additional Arm64
runner useful? I think we can build and verify Arm64 Linux releases on
it.

Yibo


RE: [VOTE] Release Apache Arrow 7.0.0 - RC10

2022-01-29 Thread Yibo Cai
+1.

Verified C++ and Python source on Arm64 ubuntu20.04.
CC=clang-12 CXX=clang++-12 TEST_SOURCE=1 TEST_DEFAULT=0 TEST_CPP=1 
TEST_PYTHON=1 dev/release/verify-release-candidate.sh source 7.0.0 10

-Original Message-
From: Krisztián Szűcs 
Sent: Saturday, January 29, 2022 7:29 PM
To: dev 
Subject: [VOTE] Release Apache Arrow 7.0.0 - RC10

Hi,

I would like to propose the following release candidate (RC10) of Apache Arrow 
version 7.0.0. This is a release consisting of 649 resolved JIRA issues[1].

This release candidate is based on commit:
e90472e35b40f58b17d408438bb8de1641bfe6ef [2]

The source release rc10 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [13] for how to validate a release candidate.

The crossbow release verification tasks' results are available at [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 7.0.0 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 7.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%207.0.0
[2]: 
https://github.com/apache/arrow/tree/e90472e35b40f58b17d408438bb8de1641bfe6ef
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc10
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/7.0.0-rc10
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/7.0.0-rc10
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/7.0.0-rc10
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/e90472e35b40f58b17d408438bb8de1641bfe6ef/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/12293
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [VOTE] Release Apache Arrow 7.0.0 - RC8

2022-01-28 Thread Yibo Cai
Gandiva unit test failed on Arm (TEST_SOURCE=1 TEST_CPP=1), but the bug is not 
arch dependent, should be fixed in 7.0 release.
Details at https://issues.apache.org/jira/browse/ARROW-15493, PR is ready.

-Original Message-
From: Krisztián Szűcs 
Sent: Wednesday, January 26, 2022 9:24 PM
To: dev 
Subject: [VOTE] Release Apache Arrow 7.0.0 - RC8

Hi,

I would like to propose the following release candidate (RC8) of Apache Arrow 
version 7.0.0. This is a release consisting of 618 resolved JIRA issues[1].

This release candidate is based on commit:
400b5d989dd3a654bc1061d19a5ae3e95972e5eb [2]

The source release rc8 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [13] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 7.0.0 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 7.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%207.0.0
[2]: 
https://github.com/apache/arrow/tree/400b5d989dd3a654bc1061d19a5ae3e95972e5eb
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc8
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/7.0.0-rc8
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/7.0.0-rc8
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/7.0.0-rc8
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/400b5d989dd3a654bc1061d19a5ae3e95972e5eb/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [VOTE] Release Apache Arrow 7.0.0 - RC8

2022-01-27 Thread Yibo Cai
BitUtilTests.TestCopyAndReverseBitmapPreAllocated test failure is tracked at 
https://issues.apache.org/jira/browse/ARROW-15461
It's due to clang-12 compiler bug. PR 
https://github.com/apache/arrow/pull/12276 fixes the issue.

-Original Message-
From: Jonathan Keane 
Sent: Friday, January 28, 2022 10:27 AM
To: dev@arrow.apache.org
Subject: Re: [VOTE] Release Apache Arrow 7.0.0 - RC8

+0 most things validate, though I haven't been able to run the C++
tests successfully

Thank you for the huge effort Krisztián.

I verified the signature + checksums on [3].

I've run the following (on macOS 12.1):

The binary verification — successful.

I've also run the source verification on:
* C++ — 1 test failure:
BitUtilTests.TestCopyAndReverseBitmapPreAllocated Is this flakey? It fails each 
time I try C++ (or any of the packages that depend on C++)
* JS — had to install yarn (should we add this to the release verification 
instructions for macos [13]?) but successful
* Go — successful
* csharp — complained about a dotnet version mismatch (I didn't dig too deeply 
on this one)


[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc8
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates

-Jon






On Wed, Jan 26, 2022 at 7:24 AM Krisztián Szűcs  
wrote:
>
> Hi,
>
> I would like to propose the following release candidate (RC8) of
> Apache Arrow version 7.0.0. This is a release consisting of 618
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> 400b5d989dd3a654bc1061d19a5ae3e95972e5eb [2]
>
> The source release rc8 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 7.0.0 [ ] +0 [ ] -1 Do not release
> this as Apache Arrow 7.0.0 because...
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND
> %20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%2
> 07.0.0
> [2]:
> https://github.com/apache/arrow/tree/400b5d989dd3a654bc1061d19a5ae3e95
> 972e5eb
> [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc8
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/7.0.0-rc8
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/7.0.0-rc8
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/7.0.0-rc8
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]:
> https://github.com/apache/arrow/blob/400b5d989dd3a654bc1061d19a5ae3e95
> 972e5eb/CHANGELOG.md
> [13]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Releas
> e+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: [VOTE] Release Apache Arrow 7.0.0 - RC6

2022-01-26 Thread Yibo Cai
arrow-utility-test failed on both x86 and Arm if verify Arrow from source with 
clang-12.
I believe it's a compiler bug and not a blocking issue.
Details at https://issues.apache.org/jira/browse/ARROW-15461

-Original Message-
From: Krisztián Szűcs 
Sent: Tuesday, January 25, 2022 2:03 AM
To: dev 
Subject: [VOTE] Release Apache Arrow 7.0.0 - RC6

Hi,

I would like to propose the following release candidate (RC6) of Apache Arrow 
version 7.0.0. This is a release consisting of 608 resolved JIRA issues[1].

This release candidate is based on commit:
cc809bd98a04f562a38107858cab669db0768cc1 [2]

The source release rc6 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [13] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 7.0.0 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 7.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%207.0.0
[2]: 
https://github.com/apache/arrow/tree/cc809bd98a04f562a38107858cab669db0768cc1
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc6
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/7.0.0-rc6
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/7.0.0-rc6
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/7.0.0-rc6
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/cc809bd98a04f562a38107858cab669db0768cc1/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: Arrow in HPC

2022-01-18 Thread Yibo Cai

Some updates.

I tested David's UCX transport patch over 100Gb network. FlightRPC over 
UCX/RDMA improves throughput about 50%, with lower and flat latency.

And I think there are chances to improve further. See test report [1].

For the data plane approach, the PoC shared memory data plane also 
introduces significantly performance boost. Details at [2].


Glad to see there are big potentials to improve FlightRPC performance.

[1] https://issues.apache.org/jira/browse/ARROW-15229
[2] https://issues.apache.org/jira/browse/ARROW-15282

On 12/30/21 11:57 PM, David Li wrote:

Ah, I see.

I think both projects can proceed as well. At some point we will have to figure 
out how to merge them, but I think it's too early to see how exactly we will 
want to refactor things.

I looked over the code and I don't have any important comments for now. Looking 
forward to reviewing when it's ready.

-David

On Wed, Dec 29, 2021, at 22:16, Yibo Cai wrote:



On 12/29/21 11:03 PM, David Li wrote:

Awesome, thanks for sharing this too!

The refactoring you have with DataClientStream what I would like to do as well 
- I think much of the existing code can be adapted to be more 
transport-agnostic and then it will be easier to support new transports 
(whether data-only or for all methods).

Where do you see the gaps between gRPC and this? I think what would happen is 
1) client calls GetFlightInfo 2) server returns a `shm://` URI 3) client sees 
the unfamiliar prefix and creates a new client for the DoGet call (it would 
have to do this anyways if, for instance, the GetFlightInfo call returned the 
address of a different server).



I mean implementation details. Some unit test runs longer than expected
(data plane timeouts reading from an ended stream). Looks grpc stream
finish message is not correctly intercepted and forwarded to data plane.
I don't think it's big problem, just need some time to debug.


I also wonder how this stacks up to UCX's shared memory backend (I did not test 
this though).



I implemented a shared memory data plane only to verify and consolidate
the data plane design, as it's the easiest (and useful) driver. I also
plan to implement a socket based data plane, not useful in practice,
only to make sure the design works ok across network. Then we can add
more useful drivers like UCX or DPDK (the benefit of DPDK is it works on
commodity hardware, unlike UCX/RDMA which requires expensive equipment).


I would like to be able to support entire new transports for certain cases 
(namely browser support - though perhaps one of the gRPC proxies would suffice 
there), but even in that case, we could make it so that a new transport only 
needs to implement the data plane methods. Only having to support the data 
plane methods would save significant implementation effort for all non-browser 
cases so I think it's a worthwhile approach.



Thanks for being interest in this approach. My current plan is to first
refactor shared memory data plane to verify it beats grpc in local rpc
by considerable margin, otherwise there must be big mistakes in my
design. After that I will fix unit test issues and deliver for community
review.

Anyway, don't let me block your implementations. And if you think it's
useful, I can push current code for more detailed discussion.


-David

On Wed, Dec 29, 2021, at 04:37, Yibo Cai wrote:

Thanks David to initiate UCX integration, great work!
I think 5Gbps network is too limited for performance evaluation. I will try the 
patch on 100Gb RDMA network, hopefully we can see some improvements.
I once benchmarked flight over 100Gb network [1], grpc based throughput is 
2.4GB/s for one thread, 8.8GB/s for six threads, about 60us latency. I also 
benchmarked raw RDMA performance (same batch sizes as flight), one thread can 
achive 9GB/s with 12us latency. Of couse the comparison is not fair. With 
David's patch, we can get a more realistic comparison.

I'm implementing a data plane approach to hope we can adopt new data 
acceleration methods easily. My approach is to replace only the FlighData 
transmission of DoGet/Put/Exchange with data plane drivers, and grpc is still 
used for all rpc calls.
Code is at my github repo [2]. Besides the framework, I just implemented a 
shared memory data plane driver as PoC. Get/Put/Exchange unit tests passed, 
TestCancel hangs, some unit tests run longer than expected, still debugging. 
The shared memory data plane performance is pretty bad now, due to repeated 
map/unmap for each read/write, pre-allocated pages should improve much, still 
experimenting.

Would like to hear community comments.

My personal opinion is the data plane approach reuses grpc control plane, may 
be easier to add new data acceleration methods, but it needs to fit into grpc 
seamlessly (there're still gaps not resolved). A new tranport requires much 
more initial effort, but may payoff later. And looks these two approaches don't 
conflict with each other.

[1] Test environment
nics

Re: Arrow in HPC

2021-12-29 Thread Yibo Cai




On 12/29/21 11:03 PM, David Li wrote:

Awesome, thanks for sharing this too!

The refactoring you have with DataClientStream what I would like to do as well 
- I think much of the existing code can be adapted to be more 
transport-agnostic and then it will be easier to support new transports 
(whether data-only or for all methods).

Where do you see the gaps between gRPC and this? I think what would happen is 
1) client calls GetFlightInfo 2) server returns a `shm://` URI 3) client sees 
the unfamiliar prefix and creates a new client for the DoGet call (it would 
have to do this anyways if, for instance, the GetFlightInfo call returned the 
address of a different server).



I mean implementation details. Some unit test runs longer than expected 
(data plane timeouts reading from an ended stream). Looks grpc stream 
finish message is not correctly intercepted and forwarded to data plane. 
I don't think it's big problem, just need some time to debug.



I also wonder how this stacks up to UCX's shared memory backend (I did not test 
this though).



I implemented a shared memory data plane only to verify and consolidate 
the data plane design, as it's the easiest (and useful) driver. I also 
plan to implement a socket based data plane, not useful in practice, 
only to make sure the design works ok across network. Then we can add 
more useful drivers like UCX or DPDK (the benefit of DPDK is it works on 
commodity hardware, unlike UCX/RDMA which requires expensive equipment).



I would like to be able to support entire new transports for certain cases 
(namely browser support - though perhaps one of the gRPC proxies would suffice 
there), but even in that case, we could make it so that a new transport only 
needs to implement the data plane methods. Only having to support the data 
plane methods would save significant implementation effort for all non-browser 
cases so I think it's a worthwhile approach.



Thanks for being interest in this approach. My current plan is to first 
refactor shared memory data plane to verify it beats grpc in local rpc 
by considerable margin, otherwise there must be big mistakes in my 
design. After that I will fix unit test issues and deliver for community 
review.


Anyway, don't let me block your implementations. And if you think it's 
useful, I can push current code for more detailed discussion.



-David

On Wed, Dec 29, 2021, at 04:37, Yibo Cai wrote:

Thanks David to initiate UCX integration, great work!
I think 5Gbps network is too limited for performance evaluation. I will try the 
patch on 100Gb RDMA network, hopefully we can see some improvements.
I once benchmarked flight over 100Gb network [1], grpc based throughput is 
2.4GB/s for one thread, 8.8GB/s for six threads, about 60us latency. I also 
benchmarked raw RDMA performance (same batch sizes as flight), one thread can 
achive 9GB/s with 12us latency. Of couse the comparison is not fair. With 
David's patch, we can get a more realistic comparison.

I'm implementing a data plane approach to hope we can adopt new data 
acceleration methods easily. My approach is to replace only the FlighData 
transmission of DoGet/Put/Exchange with data plane drivers, and grpc is still 
used for all rpc calls.
Code is at my github repo [2]. Besides the framework, I just implemented a 
shared memory data plane driver as PoC. Get/Put/Exchange unit tests passed, 
TestCancel hangs, some unit tests run longer than expected, still debugging. 
The shared memory data plane performance is pretty bad now, due to repeated 
map/unmap for each read/write, pre-allocated pages should improve much, still 
experimenting.

Would like to hear community comments.

My personal opinion is the data plane approach reuses grpc control plane, may 
be easier to add new data acceleration methods, but it needs to fit into grpc 
seamlessly (there're still gaps not resolved). A new tranport requires much 
more initial effort, but may payoff later. And looks these two approaches don't 
conflict with each other.

[1] Test environment
nics: mellanox connectx5
hosts: client (neoverse n1), server (xeon gold 5218)
os: ubuntu 20.04, linux kernel 5.4
test case: 128k batch size, DoGet

[2] https://github.com/cyb70289/arrow/tree/flight-data-plane


From: David Li 
Sent: Wednesday, December 29, 2021 3:09 AM
To: dev@arrow.apache.org 
Subject: Re: Arrow in HPC

I ended up drafting an implementation of Flight based on UCX, and doing some
of the necessary refactoring to support additional backends in the future.
It can run the Flight benchmark, and performance is about comparable to
gRPC, as tested on AWS EC2.

The implementation is based on the UCP streams API. It's extremely
bare-bones and is really only a proof of concept; a good amount of work is
needed to turn it into a usable implementation. I had hoped it would perform
markedly better than gRPC, at least in this early test, but this seems not
to be the case. That said: I am likely

Re: Arrow in HPC

2021-12-29 Thread Yibo Cai
Thanks David to initiate UCX integration, great work!
I think 5Gbps network is too limited for performance evaluation. I will try the 
patch on 100Gb RDMA network, hopefully we can see some improvements.
I once benchmarked flight over 100Gb network [1], grpc based throughput is 
2.4GB/s for one thread, 8.8GB/s for six threads, about 60us latency. I also 
benchmarked raw RDMA performance (same batch sizes as flight), one thread can 
achive 9GB/s with 12us latency. Of couse the comparison is not fair. With 
David's patch, we can get a more realistic comparison.

I'm implementing a data plane approach to hope we can adopt new data 
acceleration methods easily. My approach is to replace only the FlighData 
transmission of DoGet/Put/Exchange with data plane drivers, and grpc is still 
used for all rpc calls.
Code is at my github repo [2]. Besides the framework, I just implemented a 
shared memory data plane driver as PoC. Get/Put/Exchange unit tests passed, 
TestCancel hangs, some unit tests run longer than expected, still debugging. 
The shared memory data plane performance is pretty bad now, due to repeated 
map/unmap for each read/write, pre-allocated pages should improve much, still 
experimenting.

Would like to hear community comments.

My personal opinion is the data plane approach reuses grpc control plane, may 
be easier to add new data acceleration methods, but it needs to fit into grpc 
seamlessly (there're still gaps not resolved). A new tranport requires much 
more initial effort, but may payoff later. And looks these two approaches don't 
conflict with each other.

[1] Test environment
nics: mellanox connectx5
hosts: client (neoverse n1), server (xeon gold 5218)
os: ubuntu 20.04, linux kernel 5.4
test case: 128k batch size, DoGet

[2] https://github.com/cyb70289/arrow/tree/flight-data-plane


From: David Li 
Sent: Wednesday, December 29, 2021 3:09 AM
To: dev@arrow.apache.org 
Subject: Re: Arrow in HPC

I ended up drafting an implementation of Flight based on UCX, and doing some
of the necessary refactoring to support additional backends in the future.
It can run the Flight benchmark, and performance is about comparable to
gRPC, as tested on AWS EC2.

The implementation is based on the UCP streams API. It's extremely
bare-bones and is really only a proof of concept; a good amount of work is
needed to turn it into a usable implementation. I had hoped it would perform
markedly better than gRPC, at least in this early test, but this seems not
to be the case. That said: I am likely not using UCX properly, UCX would
still open up support for additional hardware, and this work should allow
other backends to be implemented more easily.

The branch can be viewed at
https://github.com/lidavidm/arrow/tree/flight-ucx

I've attached the benchmark output at the end.

There are still quite a few TODOs and things that need investigating:

- Only DoGet and GetFlightInfo are implemented, and incompletely at that.
- Concurrent requests are not supported, or even making more than one
  request on a connection, nor does the server support concurrent clients.
  We also need to decide whether to even support concurrent requests, and
  how (e.g. pooling multiple connections, or implementing a gRPC/HTTP2 style
  protocol, or even possibly implementing HTTP2).
- We need to make sure we properly handle errors, etc. everywhere.
- Are we using UCX in a performant and idiomatic manner? Will the
  implementation work well on RDMA and other specialized hardware?
- Do we also need to support the UCX tag API?
- Can we refactor out interfaces that allow sharing more of the
  client/server implementation between different backends?
- Are the abstractions sufficient to support other potential backends like
  MPI, libfabrics, or WebSockets?

If anyone has experience with UCX, I'd appreciate any feedback. Otherwise,
I'm hoping to plan out and try to tackle some of the TODOs above, and figure
out how this effort can proceed.

Antoine/Micah raised the possibility of extending gRPC instead. That would
be preferable, frankly, given otherwise we'd might have to re-implement a
lot of what gRPC and HTTP2 provide by ourselves. However, the necessary
proposal stalled and was dropped without much discussion:
https://groups.google.com/g/grpc-io/c/oIbBfPVO0lY

Benchmark results (also uploaded at
https://gist.github.com/lidavidm/c4676c5d9c89d4cc717d6dea07dee952):

Testing was done between two t3.xlarge instances in the same zone.
t3.xlarge has "up to 5 Gbps" of bandwidth (~600 MiB/s).

(ucx) ubuntu@ip-172-31-37-78:~/arrow/build$ env UCX_LOG_LEVEL=info 
./relwithdebinfo/arrow-flight-benchmark -transport ucx -server_host 172.31.34.4 
-num_streams=1 -num_threads=1 -records_per_stream=4096 
-records_per_batch=4096
Testing method: DoGet
[1640703417.639373] [ip-172-31-37-78:10110:0] ucp_worker.c:1627 UCX  INFO  
ep_cfg[1]: tag(tcp/ens5); stream(tcp/ens5);
[1640703417.650068] [ip-172-31-37-78:10110:1] 

Re: [VOTE] Release Apache Arrow 6.0.1 - RC1

2021-11-11 Thread Yibo Cai

+1.

Verified c++ and python source, on ubuntu 20.04, aarch64.

CC=clang-10 CXX=clang++-10 \
TEST_SOURCE=1 TEST_DEFAULT=0 TEST_CPP=1 TEST_PYTHON=1 \
dev/release/verify-release-candidate.sh source 6.0.1 1


On 11/11/21 10:39 AM, Sutou Kouhei wrote:

Hi,

I would like to propose the following release candidate (RC1) of Apache
Arrow version 6.0.1. This is a release consisting of 29
resolved JIRA issues[1].

This release candidate is based on commit:
347a88ff9d20e2a4061eec0b455b8ea1aa8335dc [2]

The source release rc1 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also verification results by GitHub Actions:

   https://github.com/apache/arrow/pull/11671

There are some known failures:

   * verify-rc-source-integration-linux-amd64
   * verify-rc-source-python-macos-arm64
   * verify-rc-wheels-macos-11-amd64
   * verify-rc-wheels-macos-11-arm64

They except verify-rc-source-integration-linux-amd64 are
also failed with 6.0.0 RC3:

   https://github.com/apache/arrow/pull/11511

Here is the verify-rc-source-integration-linux-amd64 log:

   
https://github.com/ursacomputing/crossbow/runs/4172486523?check_suite_focus=true

I'm not sure whether this is a blocker or not.

Note that the verification passed on my local machine.


The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 6.0.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow 6.0.1 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%206.0.1
[2]: 
https://github.com/apache/arrow/tree/347a88ff9d20e2a4061eec0b455b8ea1aa8335dc
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-6.0.1-rc1
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/6.0.1-rc1
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/6.0.1-rc1
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/6.0.1-rc1
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/347a88ff9d20e2a4061eec0b455b8ea1aa8335dc/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



Re: Arm64 github runner

2021-11-09 Thread Yibo Cai

Some updates, @kou, @kszucs

There are two kinds of runners provided. One is dynamic vm created on 
demand like travis, suitable for github action runner to verify pr.
Another kind is static vm with pre-deployed os, simply an arm64 aws 
cloud instance. Looks we prefer static vm as a crossbow runner.


Please note for security reasons, the runner cannot be accessed directly 
from internet. Inbound connection requests are rejected. It can initiate 
connections to other hosts. Is it okay for crossbow? Any security 
concern from crossbow side?


On 10/22/21 10:07 AM, Yibo Cai wrote:

Thanks. I'm applying for the runner. Will update when ready.

On 10/22/21 6:31 AM, Krisztián Szűcs wrote:

On Thu, Oct 21, 2021 at 11:53 PM Sutou Kouhei  wrote:


Hi,

It's useful!

We have two options to use this:

1. Use this on https://github.com/apache/arrow
2. Use this on https://github.com/ursacomputing/crossbow/

1. is for CI of each commit/pull request.

2. is for CI of daily build and "@github-actions crossbow
submit ..." comment in pull request.

(We can choose both of them.)

If we choose 1., we need to ask INFRA for adding new
self-hosted runners. Because we don't have admin permission
of apache/arrow. Could you show a URL how to use the
self-hosted runners?

If we choose 2., we will able to do all needed work by
ourselves because we have admin permission of
ursacomputing/crossbow/.

We already have a self-hosted runner configured for crossbow where we
build the Apple M1 wheels.

I think we should start to configure the new runners for crossbow and
work out the details, and then later (if we choose to) get the
required registration tokens from INFRA.



Thanks,
--
kou

In <2a0074c7-562c-3a41-8763-6e1d4ac17...@arm.com>
"Arm64 github runner" on Wed, 20 Oct 2021 12:43:16 +0800,
Yibo Cai  wrote:


Hi,

We have free Arm64 instances (maintained by Arm) as github action
self-hosted runners for open source projects.
Arrow Arm CI is currently running on Travis. Is an additional Arm64
runner useful? I think we can build and verify Arm64 Linux releases on
it.

Yibo


Re: Arrow in HPC

2021-10-26 Thread Yibo Cai



On 10/26/21 10:02 PM, David Li wrote:

Hi Yibo,

Just curious, has there been more thought on this from your/the HPC side?


Yes. I will investigate the possible approach. Maybe build a quick (and 
dirty) POC test at first.




I also realized we never asked, what is motivating Flight in this space in the 
first place? Presumably broader Arrow support in general?


No special reason. Will be great if comes up with something useful, or 
an interesting experiment otherwise.




-David

On Fri, Sep 10, 2021, at 12:27, Micah Kornfield wrote:


I would support doing the work necessary to get UCX (or really any other
transport) supported, even if it is a lot of work. (I'm hoping this clears
the path to supporting a Flight-to-browser transport as well; a few
projects seem to have rolled their own approaches but I think Flight itself
should really handle this, too.)



Another possible technical approach is investigating to see if coming up
with a  custom gRPC "channel" implementation for new transports .
Searching around it seems like there were some defunct PRs trying to
enable UCX as one, I didn't look closely enough at why they might have
failed.

On Thu, Sep 9, 2021 at 11:07 AM David Li  wrote:


I would support doing the work necessary to get UCX (or really any other
transport) supported, even if it is a lot of work. (I'm hoping this clears
the path to supporting a Flight-to-browser transport as well; a few
projects seem to have rolled their own approaches but I think Flight itself
should really handle this, too.)

 From what I understand, you could tunnel gRPC over UCX as Keith mentions,
or directly use UCX, which is what it sounds like you are thinking about.
One idea we had previously was to stick to gRPC for 'control plane'
methods, and support alternate protocols only for 'data plane' methods like
DoGet - this might be more manageable, depending on what you have in mind.

In general - there's quite a bit of work here, so it would help to
separate the work into phases, and share some more detailed
design/implementation plans, to make review more manageable. (I realize of
course this is just a general interest check right now.) Just splitting
gRPC/Flight is going to take a decent amount of work, and (from what little
I understand) using UCX means choosing from various communication methods
it offers and writing a decent amount of scaffolding code, so it would be
good to establish what exactly a 'UCX' transport means. (For instance,
presumably there's no need to stick to the Protobuf-based wire format, but
what format would we use?)

It would also be good to expand the benchmarks, to validate the
performance we get from UCX and have a way to compare it against gRPC.
Anecdotally I've found gRPC isn't quite able to saturate a connection so it
would be interesting to see what other transports can do.

Jed - how would you see MPI and Flight interacting? As another
transport/alternative to UCX? I admit I'm not familiar with the HPC space.

About transferring commands with data: Flight already has an app_metadata
field in various places to allow things like this, it may be interesting to
combine with the ComputeIR proposal on this mailing list, and hopefully you
& your colleagues can take a look there as well.

-David

On Thu, Sep 9, 2021, at 11:24, Jed Brown wrote:

Yibo Cai  writes:


HPC infrastructure normally leverages RDMA for fast data transfer

among

storage nodes and compute nodes. Computation tasks are dispatched to
compute nodes with best fit resources.

Concretely, we are investigating porting UCX as Flight transport

layer.

UCX is a communication framework for modern networks. [1]
Besides HPC usage, many projects (spark, dask, blazingsql, etc) also
adopt UCX to accelerate network transmission. [2][3]


I'm interested in this topic and think it's important that even if the

focus is direct to UCX, that there be some thought into MPI
interoperability and support for scalable collectives. MPI considers UCX to
be an implementation detail, but the two main implementations (MPICH and
Open MPI) support it and vendor implementations are all derived from these
two.










Re: [VOTE] Release Apache Arrow 6.0.0 - RC3

2021-10-21 Thread Yibo Cai

+1

Verified c++/python source on ubuntu 20.04, aarch64

ARROW_CMAKE_OPTIONS="-DCMAKE_CXX_COMPILER=/usr/bin/clang++-10 
-DCMAKE_C_COMPILER=/usr/bin/clang-10" TEST_DEFAULT=0 TEST_SOURCE=1 
TEST_CPP=1 TEST_PYTHON=1 dev/release/verify-release-candidate.sh source 
6.0.0 3


On 10/22/21 7:30 AM, Krisztián Szűcs wrote:

Hi,

I would like to propose the following release candidate (RC3) of Apache
Arrow version 6.0.0. This is a release consisting of 592
resolved JIRA issues[1].

This release candidate is based on commit:
5a5f4ce326194750422ef6f053469ed1912ce69f [2]

The source release rc3 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9].
The changelog is located at [10].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [11] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 6.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 6.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%206.0.0
[2]: 
https://github.com/apache/arrow/tree/5a5f4ce326194750422ef6f053469ed1912ce69f
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-6.0.0-rc3
[4]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/6.0.0-rc3
[8]: https://apache.jfrog.io/artifactory/arrow/python-rc/6.0.0-rc3
[9]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[10]: 
https://github.com/apache/arrow/blob/5a5f4ce326194750422ef6f053469ed1912ce69f/CHANGELOG.md
[11]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



Re: Arm64 github runner

2021-10-21 Thread Yibo Cai

Thanks. I'm applying for the runner. Will update when ready.

On 10/22/21 6:31 AM, Krisztián Szűcs wrote:

On Thu, Oct 21, 2021 at 11:53 PM Sutou Kouhei  wrote:


Hi,

It's useful!

We have two options to use this:

   1. Use this on https://github.com/apache/arrow
   2. Use this on https://github.com/ursacomputing/crossbow/

1. is for CI of each commit/pull request.

2. is for CI of daily build and "@github-actions crossbow
submit ..." comment in pull request.

(We can choose both of them.)

If we choose 1., we need to ask INFRA for adding new
self-hosted runners. Because we don't have admin permission
of apache/arrow. Could you show a URL how to use the
self-hosted runners?

If we choose 2., we will able to do all needed work by
ourselves because we have admin permission of
ursacomputing/crossbow/.

We already have a self-hosted runner configured for crossbow where we
build the Apple M1 wheels.

I think we should start to configure the new runners for crossbow and
work out the details, and then later (if we choose to) get the
required registration tokens from INFRA.



Thanks,
--
kou

In <2a0074c7-562c-3a41-8763-6e1d4ac17...@arm.com>
   "Arm64 github runner" on Wed, 20 Oct 2021 12:43:16 +0800,
   Yibo Cai  wrote:


Hi,

We have free Arm64 instances (maintained by Arm) as github action
self-hosted runners for open source projects.
Arrow Arm CI is currently running on Travis. Is an additional Arm64
runner useful? I think we can build and verify Arm64 Linux releases on
it.

Yibo


Arm64 github runner

2021-10-19 Thread Yibo Cai

Hi,

We have free Arm64 instances (maintained by Arm) as github action 
self-hosted runners for open source projects.
Arrow Arm CI is currently running on Travis. Is an additional Arm64 
runner useful? I think we can build and verify Arm64 Linux releases on it.


Yibo


Arrow in HPC

2021-09-09 Thread Yibo Cai

Hi,

We have some rough ideas of applying Flight in HPC (High Performance 
Computation). Would like to hear comments.


HPC infrastructure normally leverages RDMA for fast data transfer among 
storage nodes and compute nodes. Computation tasks are dispatched to 
compute nodes with best fit resources.


Concretely, we are investigating porting UCX as Flight transport layer. 
UCX is a communication framework for modern networks. [1]
Besides HPC usage, many projects (spark, dask, blazingsql, etc) also 
adopt UCX to accelerate network transmission. [2][3]


I see a recent discussion about decoupling Flight from gRPC. Looks this 
is also what we should do first to adapt UCX to Flight.


Another thing is HPC may transfer commands together with data payload to 
execute by the received compute node. FlighSQL looks is for similar 
purpose, though HPC normally has more flexible computation tasks, it may 
even transfer an exec binary to execute on target. [4]


[1] https://openucx.org/documentation/
[2] https://github.com/openucx/sparkucx
[3] https://blog.dask.org/2019/06/09/ucx-dgx
[4] https://arxiv.org/pdf/2108.02253.pdf


Re: [Question] Allocations along 64 byte cache lines

2021-09-07 Thread Yibo Cai

Thanks Jorge,

I'm wondering if the 64 bytes alignment requirement is for cache or for 
simd register(avx512?).


For simd, looks register width alignment does helps.
E.g., _mm_load_si128 can only load 128 bits aligned data, it performs 
better than _mm_loadu_si128, which supports unaligned load.


Again, be very skeptical to the benchmark :)
https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk


On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:

Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai  wrote:


Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
enough conditions, looks it does shows unaligned access has some penalty
over aligned access. But I don't think this is an issue in practice.

Please be very skeptical to this benchmark. It's hard to get it right
given the complexity of hardware, compiler, benchmark tool and env.

https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk


On 9/7/21 7:55 AM, Micah Kornfield wrote:


My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.



This is probably true.  [1] is the original mailing list discussion.  I
think lack of measurable differences and high overhead for 64 byte
alignment was the reason for relaxing to 8 byte alignment.

Specifically, I performed two types of tests, a "random sum" where we

compute the sum of the values taken at random indices, and "sum", where

we

sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at

least in

the latter, prefetching would help, but I do not observe any difference.



The most likely place I think where this could make a difference would be
for operations on wider types (Decimal128 and Decimal256).   Another

place

where I think alignment could help is when adding two primitive arrays

(it

sounds like this was summing a single array?).

[1]


https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E


On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou 

wrote:




Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :

Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,

since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.


Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability

to

interoperate with `std::Vec` and the ecosystem that is based on it,

since

std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For

anyone

interested, the background for this is this old PR [1] in this in

arrow2

[2].


I see. In the C++ implementation, we are not compatible with the default
allocator either (but C++ allocators as defined by the standard library
don't support resizing, which doesn't make them terribly useful for
Arrow anyway).


Neither myself in micro benches nor Ritchie from polars (query engine)

in

large scale benches observe any difference in the archs we have

available.

This is not consistent with the emphasis we put on the memory

alignments

discussion [3], and I am trying to understand the root cause for this
inconsistency.


My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.


By prefetching I mean implicit; no intrinsics involved.


Well, I'm not aware that implicit prefetching depends on alignment.

Regards

Antoine.









Re: [Question] Allocations along 64 byte cache lines

2021-09-06 Thread Yibo Cai
Did a quick bench of accessing long buffer not 8 bytes aligned. Giving 
enough conditions, looks it does shows unaligned access has some penalty 
over aligned access. But I don't think this is an issue in practice.


Please be very skeptical to this benchmark. It's hard to get it right 
given the complexity of hardware, compiler, benchmark tool and env.


https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk


On 9/7/21 7:55 AM, Micah Kornfield wrote:


My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.



This is probably true.  [1] is the original mailing list discussion.  I
think lack of measurable differences and high overhead for 64 byte
alignment was the reason for relaxing to 8 byte alignment.

Specifically, I performed two types of tests, a "random sum" where we

compute the sum of the values taken at random indices, and "sum", where we
sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at least in
the latter, prefetching would help, but I do not observe any difference.



The most likely place I think where this could make a difference would be
for operations on wider types (Decimal128 and Decimal256).   Another place
where I think alignment could help is when adding two primitive arrays (it
sounds like this was summing a single array?).

[1]
https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E

On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou  wrote:



Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :

Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,

since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.


Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability to
interoperate with `std::Vec` and the ecosystem that is based on it, since
std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For

anyone

interested, the background for this is this old PR [1] in this in arrow2
[2].


I see. In the C++ implementation, we are not compatible with the default
allocator either (but C++ allocators as defined by the standard library
don't support resizing, which doesn't make them terribly useful for
Arrow anyway).


Neither myself in micro benches nor Ritchie from polars (query engine) in
large scale benches observe any difference in the archs we have

available.

This is not consistent with the emphasis we put on the memory alignments
discussion [3], and I am trying to understand the root cause for this
inconsistency.


My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.


By prefetching I mean implicit; no intrinsics involved.


Well, I'm not aware that implicit prefetching depends on alignment.

Regards

Antoine.





RE: [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0

2021-08-24 Thread Yibo Cai
Did a quick review. Listed error positions. Removed duplicated failures.

*-osx-*
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=10372=logs=cf796865-97b7-5cd1-be8e-6e00ce4fd8cf=9f7de14c-8ff0-55c4-a998-d852f888262c=15

test-conda-python-3.6-pandas-0.23 (I remember there's a PR to fix the s3fs 
issue)
https://github.com/ursacomputing/crossbow/runs/3407983718#step:7:8039

test-ubuntu-20.04-cpp-14
https://github.com/ursacomputing/crossbow/runs/3407972368#step:7:2484

test-conda-python-3.7-turbodbc-master
https://github.com/ursacomputing/crossbow/runs/3408001481#step:7:4473

test-ubuntu-18.04-cpp-static
https://github.com/ursacomputing/crossbow/runs/3408007168#step:7:4730

test-ubuntu-18.04-r-sanitizer
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=10365=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=3668



-Original Message-
From: Neal Richardson 
Sent: Tuesday, August 24, 2021 21:27
To: bui...@arrow.apache.org; dev 
Subject: Re: [NIGHTLY] Arrow Build Report for Job nightly-2021-08-24-0

Lots of failed builds here, and many have been failing for quite some time.
Has anyone triaged these?

Does having a separate builds@ mailing list reduce the visibility too much?

Neal

On Tue, Aug 24, 2021 at 6:03 AM Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2021-08-24-0
>
> All tasks:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0
>
> Failed Tasks:
> - conda-osx-arm64-clang-py38:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-osx-arm64-clang-py38
> - conda-osx-arm64-clang-py39:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-osx-arm64-clang-py39
> - conda-osx-clang-py36-r40:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-osx-clang-py36-r40
> - conda-osx-clang-py37-r41:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-osx-clang-py37-r41
> - conda-osx-clang-py38:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-osx-clang-py38
> - conda-osx-clang-py39:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-osx-clang-py39
> - homebrew-r-autobrew:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-homebrew-r-autobrew
> - test-conda-python-3.6-pandas-0.23:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-conda-python-3.6-pandas-0.23
> - test-conda-python-3.7-pandas-0.24:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-conda-python-3.7-pandas-0.24
> - test-conda-python-3.7-turbodbc-latest:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-conda-python-3.7-turbodbc-latest
> - test-conda-python-3.7-turbodbc-master:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-conda-python-3.7-turbodbc-master
> - test-conda-python-3.8-spark-master:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-conda-python-3.8-spark-master
> - test-ubuntu-18.04-cpp-static:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-ubuntu-18.04-cpp-static
> - test-ubuntu-18.04-r-sanitizer:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-test-ubuntu-18.04-r-sanitizer
> - test-ubuntu-20.04-cpp-14:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-ubuntu-20.04-cpp-14
> - test-ubuntu-20.04-cpp-17:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-test-ubuntu-20.04-cpp-17
>
> Succeeded Tasks:
> - centos-7-amd64:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-centos-7-amd64
> - centos-8-amd64:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-github-centos-8-amd64
> - centos-8-arm64:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-travis-centos-8-arm64
> - conda-clean:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-clean
> - conda-linux-gcc-py36-arm64:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-linux-gcc-py36-arm64
> - conda-linux-gcc-py36-cpu-r40:
>   URL:
> https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-08-24-0-azure-conda-linux-gcc-py36-cpu-r40
> - conda-linux-gcc-py36-cuda:
>   URL:
> 

Re: [VOTE] Release Apache Arrow 5.0.0 - RC1

2021-07-24 Thread Yibo Cai

+1

Verified C++ and Python on Arm64 Linux (Ubuntu 20.04, aarch64).

ARROW_CMAKE_OPTIONS="-DCMAKE_CXX_COMPILER=/usr/bin/clang++-10 
-DCMAKE_C_COMPILER=/usr/bin/clang-10" TEST_DEFAULT=0 TEST_SOURCE=1 
TEST_CPP=1 TEST_PYTHON=1 dev/release/verify-release-candidate.sh source 
5.0.0 1


On 7/23/21 11:25 AM, Krisztián Szűcs wrote:

Hi,

I would like to propose the following release candidate (RC1) of Apache
Arrow version 5.0.0. This is a release consisting of 551
resolved JIRA issues[1].

This release candidate is based on commit:
4591d76fce2846a29dac33bf01e9ba0337b118e9 [2]

The source release rc1 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9].
The changelog is located at [10].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [11] for how to validate a release candidate.

Note, please use [12] to verify the Amazon Linux and CentOS packages.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 5.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 5.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%205.0.0
[2]: 
https://github.com/apache/arrow/tree/4591d76fce2846a29dac33bf01e9ba0337b118e9
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-5.0.0-rc1
[4]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/5.0.0-rc1
[8]: https://apache.jfrog.io/artifactory/arrow/python-rc/5.0.0-rc1
[9]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[10]: 
https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/CHANGELOG.md
[11]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[12]: https://github.com/apache/arrow/pull/10786



Re: [DISCUSS][C++] Strategies for SIMD cross-compilation?

2021-07-18 Thread Yibo Cai




On 7/17/21 12:08 AM, Wes McKinney wrote:

hi folks,

I had a conversation with the developers of xsimd last week in Paris
and was made aware that they are working on a substantial refactor of
xsimd to improve its usability for cross-compilation and
dynamic-dispatch based on runtime processor capabilities. The branch
with the refactor is located here:

https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring

In particular, the simd batch API is changing from

template 
class batch;

to

template 
class batch;

So rather than using xsimd::batch for an AVX512 batch,
you would do xsimd::batch (or e.g.
neon/neon64 for ARM ISAs) and then access the batch size through the
batch::size static property.


Adding this 'arch' parameter is a bit strange at first glance, given the 
purpose of an simd wrapper is to hide arch dependent code.
But as latest simd isa (sve, avx512) has much richer features than 
simply widening the data width, looks arch code is a must.

I think this change won't cause trouble to existing xsimd client code.



A few comments for discussion / investigation:

* Firstly, we will have to prepare ourselves to migrate to this new
API in the future

* At some point, we will likely want to generate SIMD-variants of our
C++ math kernels usable via dynamic dispatch for each different CPU
support level. It would be beneficial to author as much code in an
ISA-independent fashion that can be cross-compiled to generate binary
code for each ISA. We should investigate whether the new approach in
xsimd will provide what we need or if we need to take a different
approach.

* We have some of our own dynamic dispatch code to enable runtime
function pointer selection based on available SIMD levels. Can we
benefit from any of the work that is happening in this xsimd refactor?


I think they have some overlaps. Runtime dispatch at xsimd level(simd 
code block) looks better than at kernel dispatch level, IIUC.




* We have some compute code (e.g. hash tables for aggregation / joins)
that uses explicit AVX2 intrinsics — can some of this code be ported
to use generic xsimd APIs or will we need to use a different
fundamental algorithm design to yield maximum efficiency for each SIMD
ISA?

Thanks,
Wes



Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-09 Thread Yibo Cai
Congrats Weston!


From: Wes McKinney 
Sent: Friday, July 9, 2021 8:47 PM
To: dev 
Subject: [ANNOUNCE] New Arrow committer: Weston Pace

On behalf of the Arrow PMC, I'm happy to announce that Weston has accepted an
invitation to become a committer on Apache Arrow. Welcome, and thank you
for your contributions!

Wes
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-24 Thread Yibo Cai




On 6/25/21 6:58 AM, Nate Bauernfeind wrote:

FYI, the bench was slightly broken; but the results stand.


benchmark::DoNotOptimize(output[rand()]);

Since rand() has a domain of 0 to MAX_INT it blows past the output array
(of length 4k). It segfaults in GCC; I'm not sure why the Clang benchmark
is happy with that.

I modified [1] it to:

benchmark::DoNotOptimize(output[rand() % N]);


The benchmarks run:
The Clang11 -O3 speed up is 2.3x.
The GCC10.2 -O3 speed up is 2.6x.

Interestingly, I added a second branching benchmark. The original branching
does this:
```
   if (bitmap[i/8] & (1 << (i%8))) {
 output[outpos++] = input[i];
   }


```


My additional branching benchmark pulls the assignment outside of the
branch:
```
   output[outpos] = input[i];
   if (bitmap[i/8] & (1 << (i%8))) {
 ++outpos;
   }
```


From the disassembler, compiler optimizes this c code with below
branch-less assembly code.

10.06% sar%cl,%edx
10.74% and$0x1,%edx
9.93%  cmp$0x1,%edx
   sbb$0x,%esi

Basically, it reset/set the borrow bit in eflag register based on the if 
condition, and runs `outpos = outpos - (-1) - borrow_bit`.




The benchmarks run:
The GCC10.2 -O3 speed up compared to the original branching code is 2.3x.
(are you as surprised as me?)
The GCC10.2 -O3 speed up compared to the original non-branching code is
0.9x. (yes; it is slightly slower)

Point: reducing the footprint of a false branch prediction is as worthwhile
an investment when you can't simply get rid of the conditional.

Nate
[1] https://quick-bench.com/q/kDFoF2pOuvPo9aVFufkVJMjWf-g

On Thu, Jun 24, 2021 at 1:01 PM Niranda Perera 
wrote:


I created a JIRA for this. I will do the changes in select kernels and
report back with benchmark results
https://issues.apache.org/jira/browse/ARROW-13170


On Thu, Jun 24, 2021 at 12:27 AM Yibo Cai  wrote:


Did a quick test. For random bitmaps and my trivial test code, the
branch-less code is 3.5x faster than branch one.
https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro

On 6/23/21 11:21 PM, Wes McKinney wrote:

One project I was interested in getting to but haven't had the time
was introducing branch-free code into vector_selection.cc and reducing
the use of if-statements to try to improve performance.

One way to do this is to take code that looks like this:

if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
out_data_[out_position_++] = values_data_[in_position];
}
++in_position;

and change it to a branch-free version

bool advance = BitUtil::GetBit(filter_data_, filter_offset_ +

in_position);

BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;

Since more people are working on kernels and computing now, I thought
this might be an interesting project for someone to explore and see
what improvements are possible (and what the differences between e.g.
x86 and ARM architecture are like when it comes to reducing
branching). Another thing to look at might be batch-at-a-time
bitpacking in the output bitmap versus bit-at-a-time.






--
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>




--



Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-23 Thread Yibo Cai
Did a quick test. For random bitmaps and my trivial test code, the 
branch-less code is 3.5x faster than branch one.

https://quick-bench.com/q/UD22IIdMgKO9HU1PsPezj05Kkro

On 6/23/21 11:21 PM, Wes McKinney wrote:

One project I was interested in getting to but haven't had the time
was introducing branch-free code into vector_selection.cc and reducing
the use of if-statements to try to improve performance.

One way to do this is to take code that looks like this:

if (BitUtil::GetBit(filter_data_, filter_offset_ + in_position)) {
   BitUtil::SetBit(out_is_valid_, out_offset_ + out_position_);
   out_data_[out_position_++] = values_data_[in_position];
}
++in_position;

and change it to a branch-free version

bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position);
BitUtil::SetBitTo(out_is_valid_, out_offset_ + out_position_, advance);
out_data_[out_position_] = values_data_[in_position];
out_position_ += advance; // may need static_cast here
++in_position;

Since more people are working on kernels and computing now, I thought
this might be an interesting project for someone to explore and see
what improvements are possible (and what the differences between e.g.
x86 and ARM architecture are like when it comes to reducing
branching). Another thing to look at might be batch-at-a-time
bitpacking in the output bitmap versus bit-at-a-time.



Re: [ANNOUNCE] New Arrow PMC member: David M Li

2021-06-22 Thread Yibo Cai

Congrats David!

On 6/22/21 8:56 PM, David Li wrote:

Thanks everyone!

I've learned a lot and had a great time contributing here, and I look
forward to continuing to work with everybody.

Best,
David

On 2021/06/22 10:54:08, Krisztián Szűcs  wrote:

Congrats David!

On Tue, Jun 22, 2021 at 11:19 AM Rok Mihevc  wrote:


Congrats David!

On Tue, Jun 22, 2021 at 4:44 AM Micah Kornfield 
wrote:


Congrats!

On Mon, Jun 21, 2021 at 7:40 PM Weston Pace  wrote:


Congratulations David!

On Mon, Jun 21, 2021 at 2:24 PM Niranda Perera 

Congrats David! :-)

On Mon, Jun 21, 2021 at 6:32 PM Nate Bauernfeind <

nate.bauernfe...@gmail.com>

wrote:


Congratulations! Well earned!

On Mon, Jun 21, 2021 at 4:20 PM Ian Cook 

wrote:



Congratulations, David!

Ian


On Mon, Jun 21, 2021 at 6:19 PM Wes McKinney 

wrote:


The Project Management Committee (PMC) for Apache Arrow has

invited

David M Li to become a PMC member and we are pleased to announce
that David has accepted.

Congratulations and welcome!







--
Niranda Perera
https://niranda.dev/
@n1r44 








Re: Arrow Dataset API on Ceph

2021-06-21 Thread Yibo Cai

Hi Jayjeet,

I've successfully validated basic functions based on the links you 
provided, on both Arm64 and x86, with binaries built from your PR. 
Everything looks fine. From perf, I can see arrow code is running 
actively on ceph osd nodes.


Currently, I deployed and tested on 4 VMs. For performance evaluation, I 
will turn to bare metal servers with NVMe drives. Do you have test cases 
for the benchmarking? Thanks.


Yibo

On 6/8/21 4:18 PM, Jayjeet Chakraborty wrote:

Hi Yibo,

Thanks a lot for your interest in our work. Please refer to this [1] guide to 
deploy a complete environment on a cluster of nodes. Regarding your comment 
about a Ceph patch, the arrow object class that we implement is actually a 
plugin and does not require the Ceph source tree for building or maintaining 
it. It only requires the rados-objclass-dev package as a dependency. We provide 
the CMake option ARROW_CLS  to allow optional build of the object class plugin. 
Please let us know if you have any questions or comments. We really appreciate 
you taking the time to look into our project. Thanks.

Best,
Jayjeet

[1] 
https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/adapters/arrow-rados-cls/docs/deploy.md

On 2021/06/07 10:36:08, Yibo Cai  wrote:

Hi Jayjeet,

It is exciting to see a real world computational storage solution built
upon Arrow and Ceph. Amazing work!

We are interesting in this project (I'm from Arm open source software
team focusing on storage and big data OSS), and would like to reproduce
your works first, then evaluate performance on Arm platform.

I went through your Arrow PR, it looks great. IIUC, there should be a
corresponding Ceph patch implementing the object class with Arrow.

I wonder the best approach to deploy a complete environment for a quick
evaluation. Any comment is welcomed. Thanks.

Yibo

On 6/2/21 3:42 AM, Jayjeet Chakraborty wrote:

Dear Arrow Community,

In our previous discussion, we planned on implementing a new Dataset API
like InMemoryDataset to interact with objects containing IPC data stored in
Ceph/RADOS <https://ceph.io/>. We had implemented this design and raised a
PR <https://github.com/apache/arrow/pull/8647>. But when we started adding
the dataset discovery functionality, we found ourselves reimplementing
filesystem abstractions and its metadata management. We closed the original
PR and raised a new PR <https://github.com/apache/arrow/pull/10431> where
we redesigned our implementation to use the Ceph filesystem as our file I/O
interface since it provides fast metadata support via the Ceph metadata
servers (MDS). We also decided to store data using one of the file formats
supported by Arrow. One of our driving use cases favored Parquet.

Since we perform the scan operation inside the storage layer using Ceph
Object class
<https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree.>
methods which need to be invoked directly on objects, we utilize the
striping strategy information provided by CephFS to translate filename in
CephFS to  object id in RADOS. To be able to have this one-to-one mapping,
we split Parquet files in a manner similar to how Spark splits Parquet
files for HDFS and ensure that each fragment is backed by a single RADOS
object.

We are planning a new PR, we extend the FileFormat interface to create a
RadosParquetFileFormat
<https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/dataset/file_rados_parquet.h#L129>
interface that offloads Parquet file scan operations to the RADOS layer in
Ceph. Since we now utilize a filesystem interface, we can just use the
FileSystemDataset API and plug in our new format to offload scan
operations. We have also added Python bindings for the new APIs that we
implemented. In all, our patch only consists of around 3,000 LoC and
introduces new dependencies to Ceph’s librados and object class SDK only
(that can be disabled via cmake flags).

We have added an architecture
<https://github.com/uccross/skyhookdm-arrow/blob/rados-parquet-pr/cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md>
document with our PR which describes the overall architecture along with
the life of a dataset scan on using RadosParquet. Additionally, we recently
wrote up a paper <https://arxiv.org/abs/2105.09894> describing our design
and implementation along with some initial benchmarks given there. We plan
to raise a PR <https://github.com/apache/arrow/pull/10431> to upstream our
format to apache/arrow soon and hence look forward to your comments and
thoughts on this new feature. Please let us know if you have any questions.
Thank you.

Best regards,

Jayjeet Chakraborty

On 2020/09/15 18:06:56, Micah Kornfield  wrote:

gmock is already a dependency.  We haven't upgraded gmock/gtest in a

while,

we might want to consider doing that (but this is orthogonal).

On Tue, Sep 15, 202

Re: Arrow Dataset API on Ceph

2021-06-07 Thread Yibo Cai

Hi Jayjeet,

It is exciting to see a real world computational storage solution built
upon Arrow and Ceph. Amazing work!

We are interesting in this project (I'm from Arm open source software
team focusing on storage and big data OSS), and would like to reproduce
your works first, then evaluate performance on Arm platform.

I went through your Arrow PR, it looks great. IIUC, there should be a
corresponding Ceph patch implementing the object class with Arrow.

I wonder the best approach to deploy a complete environment for a quick
evaluation. Any comment is welcomed. Thanks.

Yibo

On 6/2/21 3:42 AM, Jayjeet Chakraborty wrote:

Dear Arrow Community,

In our previous discussion, we planned on implementing a new Dataset API
like InMemoryDataset to interact with objects containing IPC data stored in
Ceph/RADOS . We had implemented this design and raised a
PR . But when we started adding
the dataset discovery functionality, we found ourselves reimplementing
filesystem abstractions and its metadata management. We closed the original
PR and raised a new PR  where
we redesigned our implementation to use the Ceph filesystem as our file I/O
interface since it provides fast metadata support via the Ceph metadata
servers (MDS). We also decided to store data using one of the file formats
supported by Arrow. One of our driving use cases favored Parquet.

Since we perform the scan operation inside the storage layer using Ceph
Object class

methods which need to be invoked directly on objects, we utilize the
striping strategy information provided by CephFS to translate filename in
CephFS to  object id in RADOS. To be able to have this one-to-one mapping,
we split Parquet files in a manner similar to how Spark splits Parquet
files for HDFS and ensure that each fragment is backed by a single RADOS
object.

We are planning a new PR, we extend the FileFormat interface to create a
RadosParquetFileFormat

interface that offloads Parquet file scan operations to the RADOS layer in
Ceph. Since we now utilize a filesystem interface, we can just use the
FileSystemDataset API and plug in our new format to offload scan
operations. We have also added Python bindings for the new APIs that we
implemented. In all, our patch only consists of around 3,000 LoC and
introduces new dependencies to Ceph’s librados and object class SDK only
(that can be disabled via cmake flags).

We have added an architecture

document with our PR which describes the overall architecture along with
the life of a dataset scan on using RadosParquet. Additionally, we recently
wrote up a paper  describing our design
and implementation along with some initial benchmarks given there. We plan
to raise a PR  to upstream our
format to apache/arrow soon and hence look forward to your comments and
thoughts on this new feature. Please let us know if you have any questions.
Thank you.

Best regards,

Jayjeet Chakraborty

On 2020/09/15 18:06:56, Micah Kornfield  wrote:

gmock is already a dependency.  We haven't upgraded gmock/gtest in a

while,

we might want to consider doing that (but this is orthogonal).

On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou 

wrote:




Hi Ivo,

You can open a JIRA once you've got a PR ready.  No need to do it before
you think you're ready for submission.

AFAIK, gmock is already a dependency.

Regards

Antoine.



Le 15/09/2020 à 18:49, Ivo Jimenez a écrit :

Hi again,

We noticed in the contribution guidelines that there needs to be an

issue for every PR in JIRA. Should we open one for the eventual PR for

the

work we're doing on implementing the dataset on Ceph's RADOS?


Also, on a related note, we would like to mock the RADOS client so

that

we can integrate it in CI tests. Would it be OK to include gmock as a
dependency?


thanks!

On 2020/09/02 22:05:51, Ivo Jimenez  wrote:

Hi Ben,



Our main concern is that this new arrow::dataset::RadosFormat class

will

be

deriving from the arrow::dataset::FileFormat class, which seems to

raise

a

conceptual mismatch as there isn’t really a RADOS format


IIUC RADOS doesn't interact with a filesystem directly, so

RadosFileFormat

would
indeed be a conceptually problematic point of extension. If a RADOS

file

system
is not viable then I think the ideal approach would be to directly
implement the
Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
implementation altogether.
Unfortunately the only example we have of this approach is

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-20 Thread Yibo Cai

Great analysis Weston!

Looks SimpleRecordBatch::column() is not thread safe for gcc < 5.0 as we are
simulating shared_ptr atomic load/store with normal load/store.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.cc#L80-L87

On 5/21/21 8:15 AM, Weston Pace wrote:

I like Yibo's stack overflow theory given the "error reading variable"
but I did confirm that I can cause a segmentation fault if
std::atomic_store /  std::atomic_load are unavailable.  I simulated
this by simply commenting out the specializations rather than actually
run against GCC 4.9.2 so it may not be perfect.  I've attached a patch
with my stress test (based on the latest master,
#c697a41ab9c1153e7387fe4710df920c36ed).  Running that stress test
while running `stress -c 16` on my server reproduces it pretty
reliably.

Thread 1 (Thread 0x7f6ae05fc700 (LWP 2308757)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f6ae352e859 in __GI_abort () at abort.c:79
#2  0x7f6ae37fe892 in __gnu_cxx::__verbose_terminate_handler () at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x7f6ae37fcf69 in __cxxabiv1::__terminate (handler=) at 
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x7f6ae37fcfab in std::terminate () at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x7f6ae37fd9d0 in __cxxabiv1::__cxa_pure_virtual () at
/home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
#6  0x55a64bc4400a in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
(this=0x7f6ad0001160) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:155
#7  0x55a64bc420f3 in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7f6ae05fa568, __in_chrg=) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:730
#8  0x55a64bc3a4a2 in std::__shared_ptr::~__shared_ptr (this=0x7f6ae05fa560,
__in_chrg=) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1169
#9  0x55a64bc3a4be in std::shared_ptr::~shared_ptr
(this=0x7f6ae05fa560, __in_chrg=) at
/home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr.h:103
#10 0x55a64bc557ca in
arrow::TestRecordBatch_BatchColumnBoxingStress_Testoperator()(void)
const (__closure=0x55a64d5f5218) at
../src/arrow/record_batch_test.cc:206

As a workaround to see if this is indeed your issue, you can call
RecordBatch::column on each of the columns as soon as you create the
RecordBatch (from one thread) which will force the boxed columns to
materialize.

-Weston

On Thu, May 20, 2021 at 11:40 AM Wes McKinney  wrote:


Also, is it possible that the field is not an Int64Array?

On Wed, May 19, 2021 at 10:19 PM Yibo Cai  wrote:


On 5/20/21 4:15 AM, Rares Vernica wrote:

Hello,

I'm using Arrow for accessing data outside the SciDB database engine. It
generally works fine but we are running into Segmentation Faults in a
corner multi-threaded case. I identified two threads that work on the same
Record Batch. I wonder if there is something internal about RecordBatch
that might help solve the mystery.

We are using Arrow 0.16.0. The backtrace of the triggering thread looks
like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fdad5fb4700 (LWP 3748)]
0x7fdaa805abe0 in ?? ()
(gdb) thread
[Current thread is 2 (Thread 0x7fdad5fb4700 (LWP 3748))]
(gdb) bt
#0  0x7fdaa805abe0 in ?? ()
#1  0x00850212 in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#2  0x7fdae4b1fbf1 in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7fdad5fb1ae8, __in_chrg=) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666
#3  0x7fdae4b39d74 in std::__shared_ptr::~__shared_ptr (this=0x7fdad5fb1ae0,
__in_chrg=) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914
#4  0x7fdae4b39da8 in std::shared_ptr::~shared_ptr
(this=0x7fdad5fb1ae0, __in_chrg=) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93
#5  0x7fdae4b6a8e1 in scidb::XChunkIterator::getCoord
(this=0x7fdaa807f9f0, dim=1, index=1137) at XArray.cpp:358
#6  0x7fdae4b68ecb in scidb::XChunkIterator::XChunkIterator
(this=0x7fdaa807f9f0, chunk=..., iterationMode=0, arrowBatch=) at XArray.cpp:157


FWIW, this "error reading variable" looks suspicious. Maybe the argument
'arrowBatch' 

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Yibo Cai

On 5/20/21 4:15 AM, Rares Vernica wrote:

Hello,

I'm using Arrow for accessing data outside the SciDB database engine. It
generally works fine but we are running into Segmentation Faults in a
corner multi-threaded case. I identified two threads that work on the same
Record Batch. I wonder if there is something internal about RecordBatch
that might help solve the mystery.

We are using Arrow 0.16.0. The backtrace of the triggering thread looks
like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fdad5fb4700 (LWP 3748)]
0x7fdaa805abe0 in ?? ()
(gdb) thread
[Current thread is 2 (Thread 0x7fdad5fb4700 (LWP 3748))]
(gdb) bt
#0  0x7fdaa805abe0 in ?? ()
#1  0x00850212 in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#2  0x7fdae4b1fbf1 in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7fdad5fb1ae8, __in_chrg=) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666
#3  0x7fdae4b39d74 in std::__shared_ptr::~__shared_ptr (this=0x7fdad5fb1ae0,
__in_chrg=) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914
#4  0x7fdae4b39da8 in std::shared_ptr::~shared_ptr
(this=0x7fdad5fb1ae0, __in_chrg=) at
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93
#5  0x7fdae4b6a8e1 in scidb::XChunkIterator::getCoord
(this=0x7fdaa807f9f0, dim=1, index=1137) at XArray.cpp:358
#6  0x7fdae4b68ecb in scidb::XChunkIterator::XChunkIterator
(this=0x7fdaa807f9f0, chunk=..., iterationMode=0, arrowBatch=) at XArray.cpp:157


FWIW, this "error reading variable" looks suspicious. Maybe the argument
'arrowBatch' is trashed accidentally (stack overflow)?
https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L132


...

The backtrace of the other thread working on exactly the same Record Batch
looks like this:

(gdb) thread
[Current thread is 3 (Thread 0x7fdad61b5700 (LWP 3746))]
(gdb) bt
#0  0x7fdae3bc1ec7 in arrow::SimpleRecordBatch::column(int) const ()
from /lib64/libarrow.so.16
#1  0x7fdae4b6a888 in scidb::XChunkIterator::getCoord
(this=0x7fdab00c0bb0, dim=0, index=71) at XArray.cpp:357
#2  0x7fdae4b6a5a2 in scidb::XChunkIterator::operator++
(this=0x7fdab00c0bb0) at XArray.cpp:305
...

In both cases, the last non-Arrow code is the getCorord function
https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L355

 int64_t XChunkIterator::getCoord(size_t dim, int64_t index)
 {
 return std::static_pointer_cast(
 _arrowBatch->column(_nAtts + dim))->raw_values()[index];
 }
...
std::shared_ptr _arrowBatch;

Do you see anything suspicious about this code? What would trigger the
shared_ptr destruction which takes place in thread 2?

Thank you!
Rares



RE: [C++] Indeterminate poor performance of random number generator

2021-04-22 Thread Yibo Cai
Yes, these soft-float math (in libm.so) makes Arm binary extremely slow.

-Original Message-
From: Antoine Pitrou 
Sent: Thursday, April 22, 2021 17:20
To: dev@arrow.apache.org
Subject: Re: [C++] Indeterminate poor performance of random number generator


Le 22/04/2021 à 03:38, Yibo Cai a écrit :
>
> Both using same libstdc++.
> But std::bernoulli_distribution is inlined, so they are indeed different for 
> clang and gcc.
> https://godbolt.org/z/aT84x5Yec
> Looks a pure compiler thing.

It looks like clang generates calls to logl() and __divtf3() (soft-float long 
double division) inside the loop.  Perhaps that can be avoided by 
reimplementing the Bernoulli distribution.  If we don't care too much about 
accuracy and extreme probability values (very close to 0 or 1), that should be 
relatively easy.

Regards

Antoine.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: [C++] Indeterminate poor performance of random number generator

2021-04-22 Thread Yibo Cai

On 4/22/21 9:38 AM, Yibo Cai wrote:

On 4/21/21 6:07 PM, Antoine Pitrou wrote:



Le 21/04/2021 à 11:41, Yibo Cai a écrit :

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


Is it using GNU libstdc++ or clang's libc++?
If the latter, perhaps the Bernouilli implementation is very bad?



Both using same libstdc++.
But std::bernoulli_distribution is inlined, so they are indeed different for 
clang and gcc.
https://godbolt.org/z/aT84x5Yec
Looks a pure compiler thing.


For the record. Clang "-ffast-math" option makes the bernoulli generator 100x 
faster on Arm64, 10x faster on x86_64.
As this is only for generating test bits, looks "-ffast-math" is safe for the 
purpose.
https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math


Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-21 Thread Yibo Cai

+1

Verified C++ and Python on Arm64 Linux (Ubuntu-18.04).
TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CPP=1 TEST_PYTHON=1 
dev/release/verify-release-candidate.sh source 4.0.0 3

On 4/22/21 5:30 AM, Krisztián Szűcs wrote:

Hi,

I would like to propose the following release candidate (RC3) of Apache
Arrow version 4.0.0. This is a release consisting of 719
resolved JIRA issues[1].

This release candidate is based on commit:
f959141ece4d660bce5f7fa545befc0116a7db79 [2]

The source release rc3 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7].
The changelog is located at [8].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [9] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 4.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 4.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%204.0.0
[2]: 
https://github.com/apache/arrow/tree/f959141ece4d660bce5f7fa545befc0116a7db79
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-4.0.0-rc3
[4]: https://bintray.com/apache/arrow/centos-rc/4.0.0-rc3
[5]: https://bintray.com/apache/arrow/debian-rc/4.0.0-rc3
[6]: https://bintray.com/apache/arrow/python-rc/4.0.0-rc3
[7]: https://bintray.com/apache/arrow/ubuntu-rc/4.0.0-rc3
[8]: 
https://github.com/apache/arrow/blob/f959141ece4d660bce5f7fa545befc0116a7db79/CHANGELOG.md
[9]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Yibo Cai

On 4/21/21 6:07 PM, Antoine Pitrou wrote:



Le 21/04/2021 à 11:41, Yibo Cai a écrit :

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


Is it using GNU libstdc++ or clang's libc++?
If the latter, perhaps the Bernouilli implementation is very bad?



Both using same libstdc++.
But std::bernoulli_distribution is inlined, so they are indeed different for 
clang and gcc.
https://godbolt.org/z/aT84x5Yec
Looks a pure compiler thing.


Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Yibo Cai

On 4/21/21 5:17 PM, Antoine Pitrou wrote:


Le 21/04/2021 à 11:14, Yibo Cai a écrit :

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?


Can you try out https://github.com/apache/arrow/pull/8879 ?



Tested this PR on Arm64. It improves speed a lot, but still quite slow for 
clang built code.
For benchmark "ModeKernelNarrow/1048576/1", run time as below
- clang: master branch - 90s,   apply PR-8879 - 55s
- gcc:   master branch - 2.5s,  apply PR-8879 - 1.5s


[C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Yibo Cai

When running benchmarks on Arm64 servers, I find some benchmarks are extremely 
slow when built with clang.
E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish.
I find almost all the time is spent in generating random bits (prepare test 
data)[1], not the test itself.

Below sample code is to show the issue. Tested on Arm64 with clang-10 and 
gcc-7.5, built with -O3.
For gcc, the code finished in 0.1s. But for clang, the code finishes in 11s, 
very bad.
This issue does not happen on Apple M1, with apple clang-12 arm64 compiler.
On x86, clang random engine is also much slower than gcc built, but the gap is 
much smaller.

As std::default_random_engine is implementation defined[2], I think the 
performance (randomness, speed) is not determinate.
Maybe there are better ways to generate random bits?

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/random.cc#L101-L112
[2] https://en.cppreference.com/w/cpp/numeric/random

#include 
int main() {
  std::default_random_engine rng(42);
  std::bernoulli_distribution d(0.25);

  int s = 0;
  for (int i = 0; i < 8 * 1024 * 1024; ++i) {
s += d(rng);
  }

  return s;
}


Re: [VOTE] Release Apache Arrow 4.0.0 - RC1

2021-04-20 Thread Yibo Cai

'gandiva-decimal-test' hangs on my machine, not sure if it's a blocker issue.
Details at https://issues.apache.org/jira/browse/ARROW-12476
Test command "TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CPP=1 
dev/release/verify-release-candidate.sh source 4.0.0 1"

On 4/19/21 10:50 PM, Krisztián Szűcs wrote:

Hi,

I would like to propose the following release candidate (RC1) of Apache
Arrow version 4.0.0. This is a release consisting of 703
resolved JIRA issues[1].

This release candidate is based on commit:
9f0082d27366f2d1985d0b5abbef7f2f07fd7e7e [2]

The source release rc1 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7].
The changelog is located at [8].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [9] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 4.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 4.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%204.0.0
[2]: 
https://github.com/apache/arrow/tree/9f0082d27366f2d1985d0b5abbef7f2f07fd7e7e
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-4.0.0-rc1
[4]: https://bintray.com/apache/arrow/centos-rc/4.0.0-rc1
[5]: https://bintray.com/apache/arrow/debian-rc/4.0.0-rc1
[6]: https://bintray.com/apache/arrow/python-rc/4.0.0-rc1
[7]: https://bintray.com/apache/arrow/ubuntu-rc/4.0.0-rc1
[8]: 
https://github.com/apache/arrow/blob/9f0082d27366f2d1985d0b5abbef7f2f07fd7e7e/CHANGELOG.md
[9]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



Re: [DISCUSS][C++] Reduce usage of KernelContext in compute::

2021-03-11 Thread Yibo Cai

Beside reporting errors, maybe a kernel wants to allocate memory through 
KernelContext::memory_pool [1] in Kernel::init?
I'm not quite sure if this is a valid case. Would like to hear other comments.

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L95

Yibo

On 3/12/21 5:24 AM, Benjamin Kietzman wrote:

KernelContext is a tuple consisting of a pointers to an ExecContext and
KernelState
and an error Status. The context's error Status may be set by compute
kernels (for
example when divide-by-zero would occur) rather than returning a Result as
in the
rest of the codebase. IIUC the intent is to avoid branching on always-ok
Statuses
for kernels which don't have an error condition (for example addition
without overflow
checks).

If there's a motivating performance reason for non standard error
propagation then
we should continue using KernelContext wherever we can benefit from it.
However,
several other APIs (such as Kernel::init) also use a KernelContext to
report errors.
IMHO, it would be better to avoid the added cognitive overhead of handling
errors
through KernelContext outside hot loops which benefit from it.

Am I missing anything? Is there any reason (for example) Kernel::init
shouldn't just
return a Result>?

Ben Kietzman



Re: New committer: Yibo Cai

2021-03-07 Thread Yibo Cai

Great honor for me. Thanks all!

Yibo

On 3/6/21 9:20 AM, Neal Richardson wrote:

Congrats Yibo!

Neal

On Fri, Mar 5, 2021 at 4:46 PM Nishit Kumar  wrote:


Congratulations Yibo! Wishing you all the best, and looking forward to
your contributions in making Arrow a robust tool.

Cheers!
Nishit

Sent from my iPhone


On 06-Mar-2021, at 12:48 AM, Antoine Pitrou  wrote:


Hello,

The Project Management Committee (PMC) for Apache Arrow has invited Yibo

Cai to become a committer and we are pleased to announce that he has
accepted.  Yibo is a frequent contributor to the C++ Arrow implementation.


Being a committer enables easier contribution to the project since there

is no need to go via the patch submission process. This should enable
better productivity.


Best regards

Antoine.






RE: [Rust] Contributing to Apache Arrow

2021-03-03 Thread Yibo Cai
Hi Ivan,

I guess you didn't log in Jira? Otherwise you will see "Assign to me" link at 
the right pane.
You can click "Log In" at the upper right corner, maybe "Sign up" an account if 
you don’t have.

Yibo


-Original Message-
From: Ivan Vankov 
Sent: Wednesday, March 3, 2021 16:41
To: dev@arrow.apache.org
Subject: [Rust] Contributing to Apache Arrow

Hello,
I decided to try contributing to Apache arrow. Since I'm completely new to this 
project I've chosen a beginner friendly task ARROW-10903, but I cannot assign 
it to myself. So, could someone please help with that?
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: Requirements on JIRA usage in Apache Arrow

2021-03-02 Thread Yibo Cai
I prefer keeping Jira. Simply because I'm familiar with it and use it in daily 
work.
I will log detailed progresses, findings and todos for non-trivial tasks in 
Jira comments. It does helps me.

Yibo


From: Sutou Kouhei 
Sent: Tuesday, March 2, 2021 9:47 AM
To: dev@arrow.apache.org 
Subject: Re: Requirements on JIRA usage in Apache Arrow

Hi,

Can we discuss whether we change the platform of a single point
of truth for developer activity to GitHub from JIRA? Can we
follow "The Apache Way" with GitHub? What pros/cons do we
have by changing the platform to GitHub from JIRA? Should we
keep using JIRA for the platform?

Any thoughts?

Thanks,
--
kou

In 
  "Re: Requirements on JIRA usage in Apache Arrow" on Sat, 27 Feb 2021 15:48:04 
-0500,
  Andrew Lamb  wrote:

> Here is a proposed improvement to merge_pr.py that will offer to create a
> JIRA issue from a github PR if one does not exist:
>
> https://github.com/apache/arrow/pull/9598
>
> On Wed, Feb 17, 2021 at 4:50 PM Wes McKinney  wrote:
>
>> Read more (this is one ASF member's interpretation of the Openness
>> tenet of the Apache Way) about this:
>>
>> http://theapacheway.com/open/
>>
>> On Wed, Feb 17, 2021 at 3:46 PM Wes McKinney  wrote:
>> >
>> > For trivial PRs that do not merit mention in the changelog you could
>> > preface the issue title with something like "ARROW-XXX" and we can
>> > modify the merge tool to bypass the consistency check for these. I
>> > think some other Apache projects do this. I can understand how it
>> > might seem like a nuisance to get a Jira when fixing a typo in a
>> > README, so this is easy to fix.
>> >
>> > For contributors doing non-trivial work, I think we want to try to get
>> > people in the habit of putting out there what they are working on.
>> > That's the thing that's most consistent with "The Apache Way" ― write
>> > things down, make plans in the open, allow others to see what is going
>> > on and not have the roadmap existing exclusively in people's minds.
>> >
>> > On Wed, Feb 17, 2021 at 3:41 PM Andrew Lamb 
>> wrote:
>> > >
>> > > Thanks for the background Wes. This is exactly what I was looking for.
>> > >
>> > > I think using JIRA for the single source of truth / project management
>> has
>> > > lots of value and I don't want to propose changing that. I am trying to
>> > > lower the barrier to contributing to Arrow even more.
>> > >
>> > > While I agree creating JIRA tickets is not hard, it is simply a few
>> more
>> > > steps for every PR and every contributor. The overhead is that much
>> more if
>> > > you don't already have a JIRA account -- if I can avoid just a few more
>> > > steps and get a few more contributors I will consider it a win.
>> > >
>> > > Given this info, I will do some research into the technical options,
>> and
>> > > make a more concrete proposal / prototype for automation in a while.
>> > >
>> > > Thanks again,
>> > > Andrew
>> > >
>> > > On Wed, Feb 17, 2021 at 1:28 PM Wes McKinney 
>> wrote:
>> > >
>> > > > hi Andrew,
>> > > >
>> > > > There isn't a hard requirement. It's a culture thing where the
>> purpose
>> > > > of Jira issues is to create a changelog and for developers to
>> > > > communicate publicly what work they are proposing to perform in the
>> > > > project. We decided by consensus (essentially) that having a single
>> > > > point of truth for developer activity in the project was a good idea.
>> > > >
>> > > > On Wed, Feb 17, 2021 at 12:09 PM Andrew Lamb 
>> wrote:
>> > > > >
>> > > > > Can someone tell me / point me at what the actual "requirements"
>> for
>> > > > using
>> > > > > JIRA in Apache Arrow are?
>> > > > >
>> > > > > Specifically, I would like to know:
>> > > > >
>> > > > > 1. Where does the requirement for each commit to have a JIRA
>> ticket come
>> > > > > from? (Is that Apache Arrow specific, or is it a more general
>> Apache
>> > > > > governance requirement? Something else?)
>> > > > >
>> > > > > 2. Does each commit need to be associated with a specific JIRA user
>> > > > > account, or is a github username sufficient?
>> > > >
>> > > > We would prefer that issues be assigned to a Jira user. If you want
>> to
>> > > > create an issue on behalf of an uncooperative person and assign it to
>> > > > yourself, you can do that, too.
>> > > >
>> > > > > Background:  I am following up on an item raised at the Arrow Sync
>> call
>> > > > > today and trying to determine how much of the current required
>> Arrow JIRA
>> > > > > process could be automated. Micah mentioned that the JIRA
>> specifics might
>> > > > > be related to ASF governance process or requirements, and I am
>> trying to
>> > > > > research what those are.
>> > > >
>> > > > We could easily automate the creation of a Jira issue using a bot of
>> > > > some kind. I don't think that creating issue is a hardship, though
>> > > > (having created thousands of them myself over the last 5 years). My
>> > > > position is that the hardship exists in the mind of the 

[C++] adopting an SIMD library

2021-02-08 Thread Yibo Cai

This topic was talked in an earlier thread [1], but not landed yet.

PR https://github.com/apache/arrow/pull/9424 optimizes ByteStreamSplit with 
Arm64 NEON, maybe it's a good chance to evaluate possibility of simplifying 
arch dependent SIMD code with an SIMD library.

I did a quick comparison of four open source SIMD libraries we've mentioned in 
earlier talks.

All libraries support C++11, GCC/Clang/MSVC, x86 (up to AVX512), Arm64 NEON, 
with permissive open source licenses.

Some differences:

- nsimd: https://github.com/agenium-scale/nsimd
  * supports Arm64 SVE, Cuda
  * needs installation, not header only
  * 133 stars, 11 contributors

- mipp: https://github.com/aff3ct/MIPP
  * header only
  * 241 stars, 2 contributors

- xsimd: https://github.com/xtensor-stack/xsimd
  * header only
  * 938 stars, 28 contributors

- libsimdpp: https://github.com/p12tic/libsimdpp
  * supports PPC, MIPS
  * header only
  * 906 stars, 17 contributors

I have a little experience of libsimdpp. It's straightforward to use. I suppose 
other libraries are similar.
I would prefer xsimd. Simply because it has more stars and contributors, and a 
more active community.

[1] 
https://mail-archives.apache.org/mod_mbox/arrow-dev/202006.mbox/%3C3667345c-fdd2-5bbd-9bff-023282c377d8%40python.org%3E


pass input args directly to kernel

2020-12-14 Thread Yibo Cai

Current kernel framework divides inputs (e.g. arrays, chunked arrays) into 
batches and feeds to kernel code.
Does it make sense to pass input args directly to kernel?
I'm writing quantile kernel, need to allocate buffer to record all inputs and 
find nth at last. For chunked array, input is received chunk by chunk, kernel 
don't know the total buffer size to be allocated all at once. It will be 
convenient if the raw chunked array input is seen by the kernel.
Or there are better ways to achieve this? Thanks.


Re: [C++][Compute] question about aggregate kernels

2020-09-21 Thread Yibo Cai

Appreciate the helps from everyone. Looks arrow c++ aggregate kernel already 
addressed my problem of combining states from batches. Thanks.

On 9/18/20 12:08 PM, Micah Kornfield wrote:


Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
 to
compute the average, not an online algorithm.


Yes, it looks like the actual Spark SQL code is at [1] though.  Spark
doesn't seem use the naive algorithm for Std. Dev. [2]


[1]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
[2]
https://github.com/apache/spark/blob/e42dbe7cd41c3689910165458a92b75c02e70a03/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L105

On Thu, Sep 17, 2020 at 8:56 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:


Hi,


I think what everyone else was potentially  stating implicitly is that

for
combining details about arrays, for std. dev. and average there needs to be
more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

This; Irrespectively of which algorithm we compute an aggregate with, the
core idea is that we split the calculation in batches, and we need to be
able to reduce each batch to a set of states (e.g. N, count), so that we
can reduce these states to a single state, which can be used to compute the
final result.

Also related: https://issues.apache.org/jira/browse/ARROW-9779

Interestingly, spark uses count / N
<https://github.com/apache/spark/blob/59eb34b82c023ac56dcd08a4ceccdf612bfa7f29/examples/src/main/scala/org/apache/spark/examples/sql/SimpleTypedAggregator.scala#L83>
to compute the average, not an online algorithm.

Best,
Jorge


On Fri, Sep 18, 2020 at 5:42 AM Andrew Wieteska <
andrew.r.wiete...@gmail.com> wrote:


Dear all

I'm not sure I'm thinking about this right, but if we're looking to
leverage vectorization for standard deviation/variance would it make sense
to compute the sum, the sum of squares, and the total number of data (N)
over all chunks and compute the actual function,

stdev = sqrt(sum_squares/N - (sum/N)^2)

only once at the end? This is one of the approaches in [1].

Best wishes
Andrew

On Thu, Sep 17, 2020 at 11:29 PM Micah Kornfield 
wrote:



stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))



This is not numerically stable. Please do not use it.  Please see [1]

for

some algorithms that might be better.

The equation you provided is great in practice to calculate stdev for

one

array. It doesn't address the issue of combining stdev from multiple

arrays.


I think what everyone else was potentially  stating implicitly is that

for

combining details about arrays, for std. dev. and average there needs

to be

more state kept that is different from the elements that one is actually
dealing with.  For std. dev.  you need to keep two numbers (same with
average).

For percentiles, I think calculating exactly will require quite a large
state (for integers a histogram approach could be used to compress

this).

There are however some good approximation algorithms that can be used if
exact values are not necessary (for example t-digest [2]).  At some

point

Arrow should probably have both.

[1]



https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data

[2] https://github.com/tdunning/t-digest

On Thu, Sep 17, 2020 at 8:17 PM Yibo Cai  wrote:


Thanks Andrew. The link gives a cool method to calculate variance
incrementally. I think the problem is that it's computationally too
expensive (cannot leverage vectorization, three divisions for a single

data

point).
The equation you provided is great in practice to calculate stdev for

one

array. It doesn't address the issue of combining stdev from multiple

arrays.


On 9/16/20 6:25 PM, Andrew Lamb wrote:

Perhaps you can rewrite the functions in terms of other kernels that

can

be

merged -- for example something like the following

stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) /

count(x))/(count(x)-1)))


(loosely translated from






https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation

)

On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai  wrote:


Hi,

I have a question about aggregate kernel implementation. Any help

is

appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a
chunked array, "consume" is called for each array to get a

temporary

aggregated result, then "merge" it with previously consumed result.

For

associative operations like min/max/sum, this pattern is

convenient.

We

can

easily "merge" min/max/sum of two ar

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Yibo Cai

Thanks Jorge, Wes,
Will study the links and try to propose improvements of c++ aggregate function.

On 9/16/20 11:17 PM, Wes McKinney wrote:

Perhaps it would be helpful to look at how Clickhouse's aggregate
functions are implemented?

https://github.com/ClickHouse/ClickHouse/tree/master/src/AggregateFunctions

You're welcome to propose improvements to the kernel interface to
accommodate more complex aggregate functions

On Wed, Sep 16, 2020 at 5:27 AM Jorge Cardoso Leitão
 wrote:


Hi Yibo,

That is correct. The simplest example is an average of 3 elements
{x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is
not equal to the average:

avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) /
3 = avg({x1,x2,x3})

We are solving this in DataFusion (Rust):

Issue: https://issues.apache.org/jira/browse/ARROW-9937
Proposal:
https://docs.google.com/document/d/1n-GS103ih3QIeQMbf_zyDStjUmryRQd45ypgk884LHU/edit
PR https://github.com/apache/arrow/pull/8172,

Essentially, what you said, there needs to be two operations, `update` and
`merge`, which are not the same.
There are many other examples, and that is also why e.g. spark's UDAF's
interface requires `update` and `merge`




On Wed, Sep 16, 2020 at 12:12 PM Yibo Cai  wrote:


Hi,

I have a question about aggregate kernel implementation. Any help is
appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a
chunked array, "consume" is called for each array to get a temporary
aggregated result, then "merge" it with previously consumed result. For
associative operations like min/max/sum, this pattern is convenient. We can
easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) =
sum(array_a) + sum(array_b).

But I wonder what's the best approach to deal with operations like
stdev/percentile. Results of these operations cannot be easily "merged". We
have to walk through all the chunks to get the result. For these
operations, looks "consume" must copy the input array and do all
calculation once at "finalize" time. Or we don't expect it to support
chunked array for them.

Yibo



Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Yibo Cai

Thanks Andrew. The link gives a cool method to calculate variance 
incrementally. I think the problem is that it's computationally too expensive 
(cannot leverage vectorization, three divisions for a single data point).
The equation you provided is great in practice to calculate stdev for one 
array. It doesn't address the issue of combining stdev from multiple arrays.

On 9/16/20 6:25 PM, Andrew Lamb wrote:

Perhaps you can rewrite the functions in terms of other kernels that can be
merged -- for example something like the following

stddev(x) = sqrt((sum(x*x) - sum(x)*sum(x) / count(x))/(count(x)-1)))

(loosely translated from
https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation
)

On Wed, Sep 16, 2020 at 6:12 AM Yibo Cai  wrote:


Hi,

I have a question about aggregate kernel implementation. Any help is
appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a
chunked array, "consume" is called for each array to get a temporary
aggregated result, then "merge" it with previously consumed result. For
associative operations like min/max/sum, this pattern is convenient. We can
easily "merge" min/max/sum of two arrays, e.g, sum([array_a, array_b]) =
sum(array_a) + sum(array_b).

But I wonder what's the best approach to deal with operations like
stdev/percentile. Results of these operations cannot be easily "merged". We
have to walk through all the chunks to get the result. For these
operations, looks "consume" must copy the input array and do all
calculation once at "finalize" time. Or we don't expect it to support
chunked array for them.

Yibo





[C++][Compute] question about aggregate kernels

2020-09-16 Thread Yibo Cai

Hi,

I have a question about aggregate kernel implementation. Any help is 
appreciated.

Aggregate kernel implements "consume" and "merge" interfaces. For a chunked array, "consume" is 
called for each array to get a temporary aggregated result, then "merge" it with previously consumed result. For 
associative operations like min/max/sum, this pattern is convenient. We can easily "merge" min/max/sum of two arrays, 
e.g, sum([array_a, array_b]) = sum(array_a) + sum(array_b).

But I wonder what's the best approach to deal with operations like stdev/percentile. Results of these 
operations cannot be easily "merged". We have to walk through all the chunks to get the result. For 
these operations, looks "consume" must copy the input array and do all calculation once at 
"finalize" time. Or we don't expect it to support chunked array for them.

Yibo


Re: [DISCUSS][C++] Performance work and compiler standardization for linux

2020-06-22 Thread Yibo Cai

On 6/22/20 5:07 PM, Antoine Pitrou wrote:


Le 22/06/2020 à 06:27, Micah Kornfield a écrit :

There has been significant effort recently trying to optimize our C++
code.  One  thing that seems to come up frequently is different benchmark
results between GCC and Clang.  Even different versions of the same
compiler can yield significantly different results on the same code.

I would like to propose that we choose a specific compiler and version on
Linux for evaluating performance related PRs.  PRs would only be accepted
if they improve the benchmarks under the selected version.


Would this be a hard rule or just a guideline?  There are many ways in
which benchmark numbers can be improved or deteriorated by a PR, and in
some cases that doesn't matter (benchmarks are not always realistic, and
they are not representative of every workload).



I agree that microbenchmark is not always useful, focusing too much on
improving microbenchmark result gives me feeling of "overfit" (to some
specific microarchitecture, compiler, or use case).


Regards

Antoine.



Re: Flight benchmark question

2020-06-17 Thread Yibo Cai

On 6/17/20 8:33 PM, David Li wrote:

-- Tessian Warning --

There is something unusual about this email, please take care as it could be 
malicious.

Tessian has flagged this email because the sender could be trying to impersonate someone at your company. The 
sender, "David Li ", looks similar to "david lim 
", someone at your company.


COVID-19 update: Phishing attacks are increasing to take advantage of the 
current situation. Try contacting the sender through a different channel to 
confirm that the email was sent by them.

This warning message will be removed if you reply to or forward this email to a 
recipient outside of your organization.

 Tessian Warning End 

Hey Yibo,

Thanks for investigating this! This is a great writeup.

There was a PR recently to let clients set gRPC options like this, so
it can be enabled on a case-by-case basis:
https://github.com/apache/arrow/pull/7406
So we could add that to the benchmark or suggest it in documentation.


Thanks David, exactly what I want.



I think this benchmark is a bit of a pathological case for gRPC. gRPC
will share sockets when all client options are exactly the same; it
seems just adding TLS, for instance, would break that (unless you
intentionally shared TLS credentials, which Flight doesn't):
https://github.com/grpc/grpc/issues/15207. I believe grpc-java doesn't
have this behavior (different Channel instances won't share
connections).

Also, did you investigate the SO_ZEROCOPY flag gRPC now offers? I
wonder if that might also help performance a bit.
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga1eb58c302eaf27a5d982b30402b8f84a


Did a quick try. Ubuntu 4.15 kernel doesn't support SO_ZEROCOPY. I upgraded to 
5.7 kernel.
Per my test on same host, no obvious difference after setting this option. Will 
do more tests over network.



Best,
David

On 6/17/20, Chengxin Ma  wrote:

Hi Yibo,


Your discovery is impressive.


Did you consider the `num_streams` parameter [1] as well? If I understood
correctly, this parameter is used for setting the conceptual concurrent
streams between the client and the server, while `num_threads` is used for
setting the size of the thread pool that actually handles these streams [2].
By default, both of the two parameters are 4.


As for CPU usage, the parameter `records_per_batch`[3] has an impact as
well. If you increase the value of this parameter, you will probably see
that the data transfer speed increased while the server-side CPU usage
dropped [4].
My guess is that as more records are put in one record batch, the total
number of batches would decrease. CPU is only used for (de)serializing the
metadata (i.e. schema) of each record batch while the payload can be
transferred with zero cost [5].


[1] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
[2] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
[3] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
[4] 
https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
[5] See "Optimizing Data Throughput over gRPC"
in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/


Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Wednesday, June 17, 2020 8:35 AM, Yibo Cai  wrote:


Find a way to achieve reasonable benchmark result with multiple threads.
Diff pasted below for a quick review or try.
Tested on E5-2650, with this change:
num_threads = 1, speed = 1996
num_threads = 2, speed = 3555
num_threads = 4, speed = 5828

When running `arrow_flight_benchmark`, I find there's only one TCP
connection between client and server, no matter what `num_threads` is. All
clients share one TCP connection. At server side, I see only one thread is
processing network packets. On my machine, one client already saturates a
CPU core, so it becomes worse when `num_threads` increase, as that single
server thread becomes bottleneck.

If running in standalone mode, flight clients are from different processes
and have their own TCP connections to the server. There're separated
server threads handling network traffics for each connection, without a
central bottleneck.

I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before
give up. Setting that arg makes each client establishes its own TCP
connection to the server, similar to standalone mode.

Actually, I'm not quite sure if we should set this arg. Sharing one TCP
connection is a reasonable configuration, and it's an advantage of
gRPC[2].

Per my test, most CPU cycles are spent in kernel mode doing networking and
data transfer. Maybe better solution is to leverage modern network
techniques like RDMA or user mode stack for higher performance.

[1]
ht

Re: Flight benchmark question

2020-06-17 Thread Yibo Cai

Find a way to achieve reasonable benchmark result with multiple threads. Diff 
pasted below for a quick review or try.
Tested on E5-2650, with this change:
num_threads = 1, speed = 1996
num_threads = 2, speed = 3555
num_threads = 4, speed = 5828

When running `arrow_flight_benchmark`, I find there's only one TCP connection 
between client and server, no matter what `num_threads` is. All clients share 
one TCP connection. At server side, I see only one thread is processing network 
packets. On my machine, one client already saturates a CPU core, so it becomes 
worse when `num_threads` increase, as that single server thread becomes 
bottleneck.

If running in standalone mode, flight clients are from different processes and 
have their own TCP connections to the server. There're separated server threads 
handling network traffics for each connection, without a central bottleneck.

I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give 
up. Setting that arg makes each client establishes its own TCP connection to 
the server, similar to standalone mode.

Actually, I'm not quite sure if we should set this arg. Sharing one TCP 
connection is a reasonable configuration, and it's an advantage of gRPC[2].

Per my test, most CPU cycles are spent in kernel mode doing networking and data 
transfer. Maybe better solution is to leverage modern network techniques like 
RDMA or user mode stack for higher performance.

[1] 
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
[2] https://platformlab.stanford.edu/Seminar%20Talks/gRPC.pdf, page5


diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc
index d530093d9..6904640d3 100644
--- a/cpp/src/arrow/flight/client.cc
+++ b/cpp/src/arrow/flight/client.cc
@@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
 args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
 // Receive messages of any size
 args.SetMaxReceiveMessageSize(-1);
+// Setting this arg enables each client to open it's own TCP connection to 
server,
+// not sharing one single connection, which becomes bottleneck under high 
load.
+args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
 
 if (options.override_hostname != "") {

   args.SetSslTargetNameOverride(options.override_hostname);


On 6/15/20 10:00 PM, Wes McKinney wrote:

On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou  wrote:



Le 15/06/2020 à 15:36, Wes McKinney a écrit :


When you have only a single server, all the gRPC traffic goes through
a common port and is handled by a common server, so if both client and
server are roughly IO bound you aren't going to get better performance
by hitting the server with multiple clients simultaneously, only worse
because the packets from different client requests are intermingled in
the TCP traffic on that port. I'm not a networking expert but this is
my best understanding of what is going on.


Yibo Cai's experiment disproves that explanation, though.

When I run a single client against the test server, I get ~4 GB/s.  When
I run 6 standalone clients against the *same* test server, I get ~8 GB/s
aggregate.  So there's something else going on that limits scalability
when the benchmark executable runs all clients by itself (perhaps gRPC
clients in a single process share some underlying structure or execution
threads? I don't know).



I see, thanks. OK then clearly something else is going on.


I hope someone will implement the "multiple test servers" TODO in the
benchmark.


I think that's a bad idea *in any case*, as running multiple servers on
different ports is not a realistic expectation from users.

Regards

Antoine.


Flight benchmark question

2020-06-15 Thread Yibo Cai

I'm evaluating flight benchmark [1] on single host. Met with one problem. Would 
like to seek for help.

Flight benchmark has a "num_threads" parameter [1] to set "number of current gets". 
Counter-intuitively, setting it to larger values drops performance, "arrow-flight-benchmark --num_threads=1" 
performs much better than "arrow-flight-benchmark --num_threads=2". There's a history thread talking about 
this issue [2], explains it's better to spawn more servers on different ports rather than having all threads go to a 
single server app.

I did another test with standalone server, the result is different.

1. spawn a standalone flight server
   $ ./arrow-flight-perf-server
   Server host: localhost
   Server port: 31337

2. test one flight benchmark to get baseline performance
   $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost 
--records_per_stream=123456789
   
   Speed: 4717.28 MB/s

3. test two flight benchmarks concurrently, check scalability
   # run in one console
   $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost 
--records_per_stream=123456789
   
   Speed: 4160.94 MB/s

   # run at *same time* in another console
   $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost 
--records_per_stream=123456789
   
   Speed: 4154.65 MB/s

From this result, looks flight server has good multi core scalability. Same 
behaviour observed if tested across network.
What's the difference of above two tests, using standalone server and not.

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc#L44
[2] 
https://lists.apache.org/thread.html/rd2aa01f460dd1092c60d1ba75087c2ce87c81ac543a246549b4713fb%40%3Cdev.arrow.apache.org%3E

Yibo


Re: [C++][Discuss] Approaches for SIMD optimizations

2020-06-12 Thread Yibo Cai

On 6/12/20 2:30 PM, Micah Kornfield wrote:

Hi Frank,
Are the performance numbers you published for the baseline directly from
master?  I'd like to look at this over the next few days to see if I can
figure out what is going on.

To all:
I'd like to make sure we flush out things to consider in general, for a
path forward.

My take on this is we should still prefer writing code in this order:
1.  Plain-old C++
2.  SIMD Wrapper library (my preference would be towards something that is
going to be standardized eventually to limit 3P dependencies.  I think the
counter argument here is if any of the libraries mentioned above has much
better feature coverage on advanced instruction sets).  Please chime in if
there are other things to consider.  We should have some rubrics for when
to make use of the library (i.e. what performance gain do we get on a
workload).
3.  Native CPU intrinsics.  We should develop a rubric for when to accept
PRs for this.  This should include:
1.  Performance gain.
2.  General popularity of the architecture.

For dynamic dispatch:
I think we should probably continue down the path of building our own.  I
looked more at libsimdpp's implementation and it might be something we can
use for guidance, but as it stands, it doesn't seem to have hooks based on
CPU manufacturer, which for BMI2 intrinsics would be a requirement.  The
alternative would be to ban BMI2 intrinsics from the code (this might not
be a bad idea to limit complexity in general).

Thoughts?



I think it's a good path forward. Thanks Micah.


Thanks,
Micah









On Wed, Jun 10, 2020 at 8:35 PM Du, Frank  wrote:


Thanks Jed.

I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
build(-msse4.2)

[Unroll baseline]
 for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
   for (int64_t k = 0; k < kRoundFactor; k++) {
 sum_rounded[k] += values[i + k];
   }
 }
SumKernelFloat/32768/02.91 us 2.90 us   239992
bytes_per_second=10.5063G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0   1.89 us 1.89 us   374470
bytes_per_second=16.1847G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0 11.6 us 11.6 us60329
bytes_per_second=2.63274G/s null_percent=0 size=32.768k
SumKernelInt16/32768/06.98 us 6.98 us   100293
bytes_per_second=4.3737G/s null_percent=0 size=32.768k
SumKernelInt32/32768/03.89 us 3.88 us   180423
bytes_per_second=7.85862G/s null_percent=0 size=32.768k
SumKernelInt64/32768/01.86 us 1.85 us   380477
bytes_per_second=16.4536G/s null_percent=0 size=32.768k

[#pragma omp simd reduction(+:sum)]
#pragma omp simd reduction(+:sum)
 for (int64_t i = 0; i < n; i++)
 sum += values[i];
SumKernelFloat/32768/02.97 us 2.96 us   235686
bytes_per_second=10.294G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0   2.97 us 2.97 us   236456
bytes_per_second=10.2875G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0 11.7 us 11.7 us60006
bytes_per_second=2.61643G/s null_percent=0 size=32.768k
SumKernelInt16/32768/05.47 us 5.47 us   127999
bytes_per_second=5.58002G/s null_percent=0 size=32.768k
SumKernelInt32/32768/02.42 us 2.41 us   290635
bytes_per_second=12.6485G/s null_percent=0 size=32.768k
SumKernelInt64/32768/01.82 us 1.82 us   386749
bytes_per_second=16.7733G/s null_percent=0 size=32.768k

[SSE intrinsic]
SumKernelFloat/32768/02.24 us 2.24 us   310914
bytes_per_second=13.6335G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0   1.43 us 1.43 us   486642
bytes_per_second=21.3266G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0 6.93 us 6.92 us   100720
bytes_per_second=4.41046G/s null_percent=0 size=32.768k
SumKernelInt16/32768/03.14 us 3.14 us   222803
bytes_per_second=9.72931G/s null_percent=0 size=32.768k
SumKernelInt32/32768/02.11 us 2.11 us   331388
bytes_per_second=14.4907G/s null_percent=0 size=32.768k
SumKernelInt64/32768/01.32 us 1.32 us   532964
bytes_per_second=23.0728G/s null_percent=0 size=32.768k

I tried to tweak the kRoundFactor or using some unroll based omp simd, or
build with clang-8, unluckily I never can get the results up to intrinsic.
The ASM code generated all use SIMD instructions, only some small
difference like instruction sequences or xmm register used. The things
under compiler is really some secret for me.

Thanks,
Frank

-Original Message-
From: Jed Brown 
Sent: Thursday, June 11, 2020 1:58 AM
To: Du, Frank ; dev@arrow.apache.org
Subject: RE: [C++][Discuss] Approaches for SIMD optimizations

"Du, Frank"  writes:


The PR I committed provide a basic support for runtime dispatching. I
agree that complier should generate good vectorize for the non-null
data part 

Re: [C++][Discuss] Approaches for SIMD optimizations

2020-06-10 Thread Yibo Cai
-written approach can be better
on a single {routine, instruction set} pair, it may lead to a globally
suboptimal situation (that is, unless the number of full-time developers
and maintainers on Arrow C++ inflates significantly).

Personally, I would like interested developers and contributors (such as
Micah, Frank, Yibo Cai) to hash out the various possible approaches, and
propose a way forward (which may be hybrid).

Regards

Antoine.



[jira] [Created] (ARROW-9038) [C++] Improve BitBlockCounter

2020-06-04 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-9038:
---

 Summary: [C++] Improve BitBlockCounter
 Key: ARROW-9038
 URL: https://issues.apache.org/jira/browse/ARROW-9038
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai


ARROW-9029 implements BitBlockCounter. There are chances to improve pops 
counting performance per this review comment: 
https://github.com/apache/arrow/pull/7346#discussion_r435005226



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8979) [C++] Implement bitmap word reader and writer

2020-05-28 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8979:
---

 Summary: [C++] Implement bitmap word reader and writer
 Key: ARROW-8979
 URL: https://issues.apache.org/jira/browse/ARROW-8979
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Below three Jira tasks optimize bitmap operations(logical, copy, compare, etc) 
unaligned case. They use word-by-word approach instead of bit-by-bit to improve 
performance.
 There are some common code of read/write bitmap in words. It's better to 
implement word based bitmap reader and writer to wrap similar function and 
reduce code redundancy.
 https://issues.apache.org/jira/browse/ARROW-8553
 https://issues.apache.org/jira/browse/ARROW-8843
 https://issues.apache.org/jira/browse/ARROW-8844



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8974) [C++] Refine TransferBitmap template parameters

2020-05-27 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8974:
---

 Summary: [C++] Refine TransferBitmap template parameters
 Key: ARROW-8974
 URL: https://issues.apache.org/jira/browse/ARROW-8974
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


[TransferBitmap|https://github.com/apache/arrow/blob/44e723d9ac7c64739d419ad66618d2d56003d1b7/cpp/src/arrow/util/bit_util.cc#L110]
 has two template parameters of bool type with four combinations.
Change them to function parameters can reduce code size. I think 
"restore_trailing_bits" cannot impact performance. "invert_bits" needs 
benchmark.
Also, bool parameter is hard to figure out at [caller 
side|https://github.com/apache/arrow/blob/44e723d9ac7c64739d419ad66618d2d56003d1b7/cpp/src/arrow/util/bit_util.cc#L208],
 better to use meaningful defines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8844) [C++] Optimize TransferBitmap unaligned case

2020-05-18 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8844:
---

 Summary: [C++] Optimize TransferBitmap unaligned case
 Key: ARROW-8844
 URL: https://issues.apache.org/jira/browse/ARROW-8844
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


TransferBitmap(CopyBitmap, InvertBitmap) unaligned case is processed 
bit-by-bit[1]. Similar trick in this PR[2] may also be helpful here to improve 
performance by processing in words.

[1] 
https://github.com/apache/arrow/blob/e5a33f1220705aec6a224b55d2a6f47fbd957603/cpp/src/arrow/util/bit_util.cc#L121-L134
[2] https://github.com/apache/arrow/pull/7135



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8843) [C++] Optimize BItmapEquals unaligned case

2020-05-18 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8843:
---

 Summary: [C++] Optimize BItmapEquals unaligned case
 Key: ARROW-8843
 URL: https://issues.apache.org/jira/browse/ARROW-8843
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


BitmapEquals unaligned case compares two bitmap bit-by-bit[1]. Similar tricks 
in this PR[2] may also be helpful here to improve performance by processing in 
words.

[1] 
https://github.com/apache/arrow/blob/e5a33f1220705aec6a224b55d2a6f47fbd957603/cpp/src/arrow/util/bit_util.cc#L248-L254
[2] https://github.com/apache/arrow/pull/7135



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Yibo Cai

Thanks Wes, I'm glad to see this feature coming.

From history talks, the main concern is runtime dispatcher may cause 
performance issue.
Personally, I don't think it's a big problem. If we're using SIMD, it must be 
targeting some time consuming code.

But we do need to take care some issues. E.g, I see code like this:
for (int i = 0; i < n; ++i) {
  simd_code();
}
With runtime dispatcher, it becomes an indirect function call in each iteration.
We should change the code to move the loop inside simd_code().

It would be better if you can consider architectures other than x86(at 
framework level).
Ignore it if it costs much effort. We can always improve later.

Yibo

On 5/13/20 9:46 AM, Wes McKinney wrote:

hi,

We've started to receive a number of patches providing SIMD operations
for both x86 and ARM architectures. Most of these patches make use of
compiler definitions to toggle between code paths at compile time.

This is problematic for a few reasons:

* Binaries that are shipped (e.g. in Python) must generally be
compiled for a broad set of supported compilers. That means that AVX2
/ AVX512 optimizations won't be available in these builds for
processors that have them
* Poses a maintainability and testing problem (hard to test every
combination, and it is not practical for local development to compile
every combination, which may cause drawn out test/CI/fix cycles)

Other projects (e.g. NumPy) have taken the approach of building
binaries that contain multiple variants of a function with different
levels of SIMD, and then choosing at runtime which one to execute
based on what features the CPU supports. This seems like what we
ultimately need to do in Apache Arrow, and if we continue to accept
patches that do not do this, it will be much more work later when we
have to refactor things to runtime dispatching.

We have some PRs in the queue related to SIMD. Without taking a heavy
handed approach like starting to veto PRs, how would everyone like to
begin to address the runtime dispatching problem?

Note that the Kernels revamp project I am working on right now will
also facilitate runtime SIMD kernel dispatching for array expression
evaluation.

Thanks,
Wes



[jira] [Created] (ARROW-8728) [C++] Bitmap operation may cause buffer overflow

2020-05-07 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8728:
---

 Summary: [C++]  Bitmap operation may cause buffer overflow
 Key: ARROW-8728
 URL: https://issues.apache.org/jira/browse/ARROW-8728
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Happen to find this issue when refining bitmap operation,  [this 
code|https://github.com/apache/arrow/blob/9b75a60658327c39383bee48fa6e5827faf2ced3/cpp/src/arrow/util/bit_util.cc#L267]
 may overflow out buffer. Should be "(length + left_offset % 8)";
Improve unit test to test large offset values can trigger the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8537) [C++] Performance regression from ARROW-8523

2020-04-20 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8537:
---

 Summary: [C++] Performance regression from ARROW-8523
 Key: ARROW-8537
 URL: https://issues.apache.org/jira/browse/ARROW-8537
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yibo Cai


I optimized BitmapReader in [this PR|https://github.com/apache/arrow/pull/6986] 
and see performance uplift of BitmapReader test case. I didn't check other test 
cases as this change looks trivial.
I reviewed all test cases just now and see big performance drop of 4 cases, 
details at [PR 
link|https://github.com/apache/arrow/pull/6986#issuecomment-616915079].
I also compared performance of code using BitmapReader, no obvious changes 
found. Looks we should revert that PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8523) [C++] Optimize BitmapReader

2020-04-19 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8523:
---

 Summary: [C++] Optimize BitmapReader
 Key: ARROW-8523
 URL: https://issues.apache.org/jira/browse/ARROW-8523
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8496) [C++] Refine ByteStreamSplitDecodeScalar

2020-04-17 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8496:
---

 Summary: [C++] Refine ByteStreamSplitDecodeScalar
 Key: ARROW-8496
 URL: https://issues.apache.org/jira/browse/ARROW-8496
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8440) Refine simd header files

2020-04-14 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8440:
---

 Summary: Refine simd header files
 Key: ARROW-8440
 URL: https://issues.apache.org/jira/browse/ARROW-8440
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


This is a follow up of ARROW-8227. It aims to unify simd header files and 
simplify code.
Currently, sse header files are included in sse_util.h, neon header files in 
neon_util.h and avx header files included directly in c source file.  
sse_util.h/neon_util.h also contain crc code which is not used by  cpp files 
#include them.
It may be better to put all simd header files in a single simd.h, and move crc 
code to where they are used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8438) [C++] arrow-io-memory-benchmark crashes

2020-04-14 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8438:
---

 Summary: [C++] arrow-io-memory-benchmark crashes
 Key: ARROW-8438
 URL: https://issues.apache.org/jira/browse/ARROW-8438
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yibo Cai


"arrow-io-memory-benchmark" SIGSEGV in latest code base. It worked at least 
when my last commit 8 days ago: b1d4c86eb28267525c52f436c3a096e70b8ef6e0

stack backtrace attached

(gdb) r
Starting program: /home/cyb/share/debug/arrow-io-memory-benchmark 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
(gdb) [New Thread 0x737ff700 (LWP 29065)]
2020-04-14 14:24:40
Running /home/cyb/share/debug/arrow-io-memory-benchmark
Run on (32 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x16)
  L1 Instruction 64K (x16)
  L2 Unified 512K (x16)
  L3 Unified 4096K (x16)
Load Average: 2.64, 4.39, 4.28
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 
be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.

Thread 1 "arrow-io-memory" received signal SIGSEGV, Segmentation fault.
0x768e67c8 in arrow::Buffer::is_mutable (this=0x0) at 
../src/arrow/buffer.h:258
258 ../src/arrow/buffer.h: No such file or directory.
(gdb) bt
#0  0x768e67c8 in arrow::Buffer::is_mutable (this=0x0) at 
../src/arrow/buffer.h:258
#1  0x76c3c41a in 
arrow::io::FixedSizeBufferWriter::FixedSizeBufferWriterImpl::FixedSizeBufferWriterImpl
 (this=0x558921f0, buffer=std::shared_ptr (empty) = {...})
at ../src/arrow/io/memory.cc:164
#2  0x76c3a575 in 
arrow::io::FixedSizeBufferWriter::FixedSizeBufferWriter (this=0x7fffd660, 
buffer=std::shared_ptr (empty) = {...}, __in_chrg=, 
__vtt_parm=) at ../src/arrow/io/memory.cc:227
#3  0x555ebd00 in arrow::ParallelMemoryCopy (state=...) at 
../src/arrow/io/memory_benchmark.cc:303
#4  0x555f80d4 in benchmark::internal::FunctionBenchmark::Run 
(this=0x55891290, st=...)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_register.cc:496
#5  0x5564bcc7 in benchmark::internal::BenchmarkInstance::Run 
(this=0x558939c0, iters=10, thread_id=0, timer=0x7fffd7a0, 
manager=0x55894b70)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_api_internal.cc:10
#6  0x5562c0c8 in benchmark::internal::(anonymous 
namespace)::RunInThread (b=0x558939c0, iters=10, thread_id=0, 
manager=0x55894b70)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:119
#7  0x5562c95a in benchmark::internal::(anonymous 
namespace)::BenchmarkRunner::DoNIterations (this=0x7fffddc0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:214
#8  0x5562d0ac in benchmark::internal::(anonymous 
namespace)::BenchmarkRunner::DoOneRepetition (this=0x7fffddc0, 
repetition_index=0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:299
#9  0x5562c558 in benchmark::internal::(anonymous 
namespace)::BenchmarkRunner::BenchmarkRunner (this=0x7fffddc0, b_=..., 
complexity_reports_=0x7fffdef0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:161
#10 0x5562d47f in benchmark::internal::RunBenchmark (b=..., 
complexity_reports=0x7fffdef0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:355
#11 0x555f0ae6 in benchmark::internal::(anonymous 
namespace)::RunBenchmarks (benchmarks=std::vector of length 9, capacity 12 = 
{...}, display_reporter=0x55891510, file_reporter=0x0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark.cc:265
#12 0x555f13b6 in benchmark::RunSpecifiedBenchmarks 
(display_reporter=0x55891510, file_reporter=0x0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark.cc:399
#13 0x555f0ef8 in benchmark::RunSpecifiedBenchmarks () at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark.cc:340
#14 0x555efc64 in main (argc=1, argv=0x7fffe398) at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_main.cc:17




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8238) [C++][Compute] Failed to build compute tests on windows with msvc2015

2020-03-27 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8238:
---

 Summary: [C++][Compute] Failed to build compute tests on windows 
with msvc2015
 Key: ARROW-8238
 URL: https://issues.apache.org/jira/browse/ARROW-8238
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Compute
Reporter: Yibo Cai


Build Arrow compute tests on Windows10 with MSVC2015:
{code:bash}
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_COMPUTE=ON 
-DARROW_BUILD_TESTS=ON ..

ninja -j3
{code}

Build failed with below message:
{code:bash}
[311/405] Linking CXX executable release\arrow-misc-test.exe
FAILED: release/arrow-misc-test.exe
cmd.exe /C "cd . && 
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\bin\cmake.exe -E 
vs_link_exe --intdir=src\arrow\CMakeFiles\arrow-misc-test.dir 
--rc=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\rc.exe 
--mt=C:\PROGRA~2\WI3CF2~1\8.1\bin\x64\mt.exe --manifests  -- 
C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj  
/out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
/pdb:release\arrow-misc-test.pdb /version:0.0  /machine:x64  
/NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console  
release\arrow_testing.lib  release\arrow.lib  
googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib  
googletest_ep-prefix\src\googletest_ep\lib\gtest.lib  
googletest_ep-prefix\src\googletest_ep\lib\gmock.lib  
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib  
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib  
Ws2_32.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib && cd ."
LINK: command "C:\PROGRA~2\MICROS~1.0\VC\bin\amd64\link.exe /nologo 
src\arrow\CMakeFiles\arrow-misc-test.dir\memory_pool_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\result_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\pretty_print_test.cc.obj 
src\arrow\CMakeFiles\arrow-misc-test.dir\status_test.cc.obj 
/out:release\arrow-misc-test.exe /implib:release\arrow-misc-test.lib 
/pdb:release\arrow-misc-test.pdb /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT 
/INCREMENTAL:NO /subsystem:console release\arrow_testing.lib release\arrow.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_filesystem.lib 
C:\Users\yibcai01\Miniconda3\envs\arrow-dev\Library\lib\boost_system.lib 
Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib 
oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
/MANIFESTFILE:release\arrow-misc-test.exe.manifest" failed (exit code 1169) 
with the following output:
arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
std::vector >::vector >(class std::initializer_list,class 
std::allocator const &)" 
(??0?$vector@HV?$allocator@H@std@@@std@@QEAA@V?$initializer_list@H@1@AEBV?$allocator@H@1@@Z)
 already defined in result_test.cc.obj
arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: __cdecl 
std::vector >::~vector >(void)" (??1?$vector@HV?$allocator@H@std@@@std@@QEAA@XZ) 
already defined in result_test.cc.obj
arrow_testing.lib(arrow_testing.dll) : error LNK2005: "public: unsigned __int64 
__cdecl std::vector >::size(void)const " 
(?size@?$vector@HV?$allocator@H@std@@@std@@QEBA_KXZ) already defined in 
result_test.cc.obj
release\arrow-misc-test.exe : fatal error LNK1169: one or more multiply defined 
symbols found
[313/405] Building CXX object 
src\arrow\CMakeFiles\arrow-table-test.dir\table_builder_test.cc.obj
ninja: build stopped: subcommand failed.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8227) [C++] Propose refining SIMD code framework

2020-03-25 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8227:
---

 Summary: [C++] Propose refining SIMD code framework
 Key: ARROW-8227
 URL: https://issues.apache.org/jira/browse/ARROW-8227
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Arrow supports wide range of hardware(x86,arm,ppc?) + 
os(linux,windows,macos,others?) + compiler(gcc,clang,msvc,others?). Managing 
platform dependent code is non-trivial. This Jira aims to refine(or mess up) 
simd related code framework.
Some goals: Move simd feature definition into one place, possibly in cmake, and 
reduce compiler based ifdef is source code. Manage simd code in one place, but 
leave non-simd default implementations where they are. Shouldn't introduce any 
performance penalty, prefer direct inline to runtime dispatcher. Code should be 
easy to maintain, expand, and hard to make mistakes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2020-03-19 Thread Yibo Cai

Thanks Wes for quick response.
Yes, inlining can be a problem for runtime dispatcher. It means we should
take care of the whole loop[1], not the code inside the loop[2]. This may
lead to some traps to developer.

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bpacking.h#L3760
[2] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bpacking.h#L40

On 3/20/20 11:03 AM, Wes McKinney wrote:

hi Yibo,

I agree with this, having #ifdef in many places in the codebase is not
maintainable longer-term.

As far as runtime dispatch, we could populate a function table of all
machine-dependent functions once so then the dispatch isn't happening
on each function. Or some similar strategy

This of course presumes that functions with runtime SIMD dispatch do
not need to be inlined. For functions that need to be inlined, a
different approach may be required.

- Wes

On Thu, Mar 19, 2020 at 9:57 PM Yibo Cai  wrote:


I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more 
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
#ifdef is obviously no-go.

So I'm selling my proposal again :)
- put all machine dependent code in one place (similar to what linux manages 
various cpu arches)
- add runtime dispatcher to select best simd code snippet per running hardware

I can provide a PR for community review first. Thoughts?

[1] https://github.com/apache/arrow/pull/6650

On 2019/12/24 18:17:25, Wes McKinney  wrote:

If we go the route of AOT-compilation of Gandiva kernels as an>
approach to generate a shared library with many kernels, we might>
indeed look at possibly generating a "fat" binary with runtime>
dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD>
altogether) kernels. This is something we could do during the code>
generation step where we generate the "stubs" to invoke the IR>
kernels.>

Given where the project is at in its development trajectory, it seems>
important to come up with some concrete answers to some of these>
questions to reduce developer anxiety that may otherwise prevent>
forward progress in feature development.>

On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield  wrote:>



I would lean against adding another library dependency.  My main concerns>
with adding another library dependency are:>
1.  Supporting it across all of the build tool-chains (using a GCC specific>
option would be my least favorite approach).>
2.  Distributed binary size (for wheels at least people seem to care).>



I would like lean more towards yes if there were some real world benchmarks>
showing the a substantial performance gain.>



I don't think it is unreasonable to package our binaries targeting a common>
instruction set (e.g. AVX 1 or 2).  For those that want to make full use of>
their latest hardware compiling from source doesn't seem unreasonable,>
especially given the recent effort to trim dependencies.>



Cheers,>
Micah>





On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:>





Hi,>



I would recommend against reinventing the wheel.  It would be possible>
to reuse an existing C++ SIMD library.  There are several of them (Vc,>
xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible>
answer.>



Regards>



Antoine.>




Le 20/12/2019 à 08:32, Yibo Cai a écrit :>

Hi,>



I'm investigating SIMD support to C++ compute kernel(not gandiva).>



A typical case is the sum kernel[1]. Below tight loop can be easily>

optimized with SIMD.>



for (int64_t i = 0; i < length; i++) {>
local.sum += values[i];>
}>



Compiler already does loop vectorization. But it's done at compile time>

without knowledge of target cpu.>

Binaries compiled with avx-512 cannot run on old cpu, while binaries>

compiled with only sse4 enabled is suboptimal on new hardware.>



I have some proposals, would like to hear comments from community.>



- Based on our experience of ISA-L[2] project(optimized storage>

acceleration library for x86 and Arm), runtime dispatcher is a good>
approach. Basically, it links in codes optimized for different cpu>
features(sse4,avx2,neon,...) and selects the best one fits target cpu at>
first invocation. This is similar to gcc indirect function[3], but doesn't>
depend on compilers.>



- Use gcc FMV [4] to generate multiple binaries for one function. See>

sample source and compiled code [5].>

Though looks simple, it has many limitations: It's gcc specific>

feature, no support from clang and msvc. It only works on x86, no Arm>
support.>

I think this approach is no-go.>



- Don't do it.>
Gandiva leverages LLVM JIT for runtime code optimization. Is it>

duplicated effort to do it in C++ kernel? Will the

Re: [C++][Compute] RFC: add SIMD support to C++ kernel

2020-03-19 Thread Yibo Cai

I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more 
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future. 
#ifdef is obviously no-go.

So I'm selling my proposal again :)
- put all machine dependent code in one place (similar to what linux manages 
various cpu arches)
- add runtime dispatcher to select best simd code snippet per running hardware

I can provide a PR for community review first. Thoughts?

[1] https://github.com/apache/arrow/pull/6650

On 2019/12/24 18:17:25, Wes McKinney  wrote:
If we go the route of AOT-compilation of Gandiva kernels as an> 
approach to generate a shared library with many kernels, we might> 
indeed look at possibly generating a "fat" binary with runtime> 
dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD> 
altogether) kernels. This is something we could do during the code> 
generation step where we generate the "stubs" to invoke the IR> 
kernels.> 

Given where the project is at in its development trajectory, it seems> 
important to come up with some concrete answers to some of these> 
questions to reduce developer anxiety that may otherwise prevent> 
forward progress in feature development.> 

On Tue, Dec 24, 2019 at 2:37 AM Micah Kornfield  wrote:> 
>> 
> I would lean against adding another library dependency.  My main concerns> 
> with adding another library dependency are:> 
> 1.  Supporting it across all of the build tool-chains (using a GCC specific> 
> option would be my least favorite approach).> 
> 2.  Distributed binary size (for wheels at least people seem to care).> 
>> 
> I would like lean more towards yes if there were some real world benchmarks> 
> showing the a substantial performance gain.> 
>> 
> I don't think it is unreasonable to package our binaries targeting a common> 
> instruction set (e.g. AVX 1 or 2).  For those that want to make full use of> 
> their latest hardware compiling from source doesn't seem unreasonable,> 
> especially given the recent effort to trim dependencies.> 
>> 
> Cheers,> 
> Micah> 
>> 
>> 
>> 
> On Fri, Dec 20, 2019 at 2:13 AM Antoine Pitrou  wrote:> 
>> 
> >> 
> > Hi,> 
> >> 
> > I would recommend against reinventing the wheel.  It would be possible> 
> > to reuse an existing C++ SIMD library.  There are several of them (Vc,> 
> > xsimd, libsimdpp...).  Of course, "just use Gandiva" is another possible> 
> > answer.> 
> >> 
> > Regards> 
> >> 
> > Antoine.> 
> >> 
> >> 
> > Le 20/12/2019 à 08:32, Yibo Cai a écrit :> 
> > > Hi,> 
> > >> 
> > > I'm investigating SIMD support to C++ compute kernel(not gandiva).> 
> > >> 
> > > A typical case is the sum kernel[1]. Below tight loop can be easily> 
> > optimized with SIMD.> 
> > >> 
> > > for (int64_t i = 0; i < length; i++) {> 
> > >local.sum += values[i];> 
> > > }> 
> > >> 
> > > Compiler already does loop vectorization. But it's done at compile time> 
> > without knowledge of target cpu.> 
> > > Binaries compiled with avx-512 cannot run on old cpu, while binaries> 
> > compiled with only sse4 enabled is suboptimal on new hardware.> 
> > >> 
> > > I have some proposals, would like to hear comments from community.> 
> > >> 
> > > - Based on our experience of ISA-L[2] project(optimized storage> 
> > acceleration library for x86 and Arm), runtime dispatcher is a good> 
> > approach. Basically, it links in codes optimized for different cpu> 
> > features(sse4,avx2,neon,...) and selects the best one fits target cpu at> 
> > first invocation. This is similar to gcc indirect function[3], but doesn't> 
> > depend on compilers.> 
> > >> 
> > > - Use gcc FMV [4] to generate multiple binaries for one function. See> 
> > sample source and compiled code [5].> 
> > >Though looks simple, it has many limitations: It's gcc specific> 
> > feature, no support from clang and msvc. It only works on x86, no Arm> 
> > support.> 
> > >I think this approach is no-go.> 
> > >> 
> > > - Don't do it.> 
> > >Gandiva leverages LLVM JIT for runtime code optimization. Is it> 
> > duplicated effort to do it in C++ kernel? Will these vetorizable> 
> > computations move to Gandiva in the future?> 
> > >> 
> > > [1]> 
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106> 
> > > [2] https://github.com/intel/isa-l> 
> > > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/> 
> > > [4] https://lwn.net/Articles/691932/> 
> > > [5] https://godbolt.org/z/ajpuq_> 
> > >> 
> >> 



[jira] [Created] (ARROW-8129) [C++][Compute] Refine compare sorting kernel

2020-03-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8129:
---

 Summary: [C++][Compute] Refine compare sorting kernel
 Key: ARROW-8129
 URL: https://issues.apache.org/jira/browse/ARROW-8129
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yibo Cai
Assignee: Yibo Cai


Sorting kernel implements two comparison functions, 
[CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67]
 use array.Value() for numeric data and 
[CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72]
 uses array.GetView() for non-numeric ones. It can be simplified by using 
GetView() only as all data types support GetView().

To my surprise, benchmark shows about 40% performance improvement after the 
change.

After some digging, I find in current code, the [comparison 
callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94]
 is not inlined (check disassembled code), it leads to a function call. It's 
very bad for this hot loop. Using only GetView() fixes this issue, code inlined 
okay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8126) [C++][Compute] Add Top-K kernel benchmark

2020-03-15 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8126:
---

 Summary: [C++][Compute] Add Top-K kernel benchmark
 Key: ARROW-8126
 URL: https://issues.apache.org/jira/browse/ARROW-8126
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7587) [C++][Compute] Add Top-k kernel

2020-01-15 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7587:
---

 Summary: [C++][Compute] Add Top-k kernel
 Key: ARROW-7587
 URL: https://issues.apache.org/jira/browse/ARROW-7587
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai


Add a kernel to get top k smallest or largest elements (indices).
std::paiital_sort should be a better solution than sorting everything then pick 
top k.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7557) [C++][Compute] Validate sorting stability in random test

2020-01-12 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7557:
---

 Summary: [C++][Compute] Validate sorting stability in random test
 Key: ARROW-7557
 URL: https://issues.apache.org/jira/browse/ARROW-7557
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai


Sorting kernel unit test doesn't validate sorting stability in random test. [1]
Should assert "lhs < rhs" when "array.Value(lhs) == array.Value(rhs)".

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices_test.cc#L112-L121



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7526) [C++][Compute]: Optimize small integer sorting

2020-01-09 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7526:
---

 Summary: [C++][Compute]: Optimize small integer sorting
 Key: ARROW-7526
 URL: https://issues.apache.org/jira/browse/ARROW-7526
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yibo Cai
Assignee: Yibo Cai


Current sorting kernel handles all data types with stl stable_sort. It is 
suboptimal for small integers like Int8, in which case counting sort is more 
suitable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[C++]: cmake: about parallel build of third party modules

2020-01-01 Thread Yibo Cai

I noticed a fresh build always stuck at compiling protobuf for a long time. 
We've decided to use single job building for each third party module [1], 
partly because different thirty party modules are built concurrently (protobuf 
is built concurrently with jemalloc, but protobuf itself is built with only one 
job).

The problem is that protobuf takes much more time than other modules to finish, which 
blocks the whole building process. I tried adding "-j4" manually when compiling 
protobuf [2], it significantly reduced time for a fresh build. Below is my testing.

test setting

cpu: Intel E5-2650, 20 logical cores
memory: 64G

test command

cmake -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=ON -DARROW_COMPUTE=ON 
-DARROW_GANDIVA=ON ..
make -j8

test result
---
build protobuf with single job(default): 10min 32sec
build protobuf with four jobs(add -j4):  6min  23sec

Build time dropped 40% from 632s to 383s. Even bigger gap is observed on Arm 
platform.

I would suggest enabling multi job build for protobuf, maybe set half total jobs. Say, it we launch 
arrow build with "make -j8", we compile protobuf with "-j4". Code may be kind 
of ugly and not consistent, but deserves the effort IMHO. Comments?

[1] https://github.com/apache/arrow/pull/2779
[2] 
https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L1163


[jira] [Created] (ARROW-7464) [C++][util]: Refine CpuInfo singleton with std::call_once

2019-12-22 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7464:
---

 Summary: [C++][util]: Refine CpuInfo singleton with std::call_once
 Key: ARROW-7464
 URL: https://issues.apache.org/jira/browse/ARROW-7464
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


CpuInfo singleton is created and initialized on first invocation of
[CpuInfo::GetInstance()|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_info.cc#L188-L195].
 All calls afterwards return reference to the
same instance. Current code uses std::mutex to make sure that CpuInfo
is created only once, but it introduces unnecessary overhead for later
calls. Concurrent threads getting the created instance should not block
each other.

Replace std::mutex with std::call_once to fix this issue.

References:
[1] https://en.cppreference.com/w/cpp/thread/call_once
[2] http://www.modernescpp.com/index.php/thread-safe-initialization-of-data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[C++][Compute] RFC: add SIMD support to C++ kernel

2019-12-19 Thread Yibo Cai

Hi,

I'm investigating SIMD support to C++ compute kernel(not gandiva).

A typical case is the sum kernel[1]. Below tight loop can be easily optimized 
with SIMD.

for (int64_t i = 0; i < length; i++) {
  local.sum += values[i];
}

Compiler already does loop vectorization. But it's done at compile time without 
knowledge of target cpu.
Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled 
with only sse4 enabled is suboptimal on new hardware.

I have some proposals, would like to hear comments from community.

- Based on our experience of ISA-L[2] project(optimized storage acceleration 
library for x86 and Arm), runtime dispatcher is a good approach. Basically, it 
links in codes optimized for different cpu features(sse4,avx2,neon,...) and 
selects the best one fits target cpu at first invocation. This is similar to 
gcc indirect function[3], but doesn't depend on compilers.

- Use gcc FMV [4] to generate multiple binaries for one function. See sample 
source and compiled code [5].
  Though looks simple, it has many limitations: It's gcc specific feature, no 
support from clang and msvc. It only works on x86, no Arm support.
  I think this approach is no-go.

- Don't do it.
  Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated 
effort to do it in C++ kernel? Will these vetorizable computations move to 
Gandiva in the future?

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
[2] https://github.com/intel/isa-l
[3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
[4] https://lwn.net/Articles/691932/
[5] https://godbolt.org/z/ajpuq_


[jira] [Created] (ARROW-7404) [C++][Gandiva] Fix utf8 char length error on Arm64

2019-12-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7404:
---

 Summary: [C++][Gandiva] Fix utf8 char length error on Arm64
 Key: ARROW-7404
 URL: https://issues.apache.org/jira/browse/ARROW-7404
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Yibo Cai
Assignee: Yibo Cai


Current code checks if a UTF-8 eight-bit code unit is within 0x00~0x7F
by "if (c >= 0)", where c is defined as "char". This checking assumes
char is always signed, which is not true[1]. On Arm64, char is unsigned
by default and causes some Gandiva unit tests fail.

Fix it by casting to "signed char" explicitly.

[1] Cited from https://en.cppreference.com/w/cpp/language/types
The signedness of char depends on the compiler and the target platform:
the defaults for ARM and PowerPC are typically unsigned, the defaults
for x86 and x64 are typically signed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7403) [C++][JSON] Enable Rapidjson on Arm64 Neon

2019-12-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7403:
---

 Summary: [C++][JSON] Enable Rapidjson on Arm64 Neon
 Key: ARROW-7403
 URL: https://issues.apache.org/jira/browse/ARROW-7403
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Rapidjson support Arm64 Neon, but it's not enabled in Arrow now. We need to 
define macro RAPIDJSON_NEON to build Rapidjson with Neon support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7397) [C++] Json white space length detection error

2019-12-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-7397:
---

 Summary: [C++] Json white space length detection error
 Key: ARROW-7397
 URL: https://issues.apache.org/jira/browse/ARROW-7397
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yibo Cai


Commit 21ca13a5cd [1] introduces a bug in ConsumeWhitespace() function.
When all chars in a string are white spaces, it should return string
length. But current code returns 0. It's not noticed because x86 goes
rapidjson simd code path which is okay. Arm64 now goes the buggy code
path and triggers json unit test failure.

 [1] 
https://github.com/apache/arrow/commit/21ca13a5cd9c1478d64370732fcfae72d52350dd#diff-664e724274fbe0ff1e03745aa452b4d6R48




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Gandiva] How to optimize per CPU feature

2019-12-15 Thread Yibo Cai

On 12/13/19 7:45 PM, Ravindra Pindikura wrote:

On Fri, Dec 13, 2019 at 3:41 PM Yibo Cai  wrote:


Hi,

Thanks to pravindra's patch [1], Gandiva loop vectorization is okay now.

Will Gandiva detects CPU feature at runtime? My test CPU supports sse to
avx2, but I only
see "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" in IR, and final code
doesn't leverage
registers longer than 128.



Can you please give some details about the hardware/OS-version you are
running this on ? Also, are you building the binaries and running them on
the same host ?



I'm building and running on same host.

Build: cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DARROW_BUILD_TESTS=ON 
-DARROW_GANDIVA=ON ..

OS: ubuntu 18.04

CPU: lscpu outputs below

Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   60
Model name:  Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Stepping:3
CPU MHz: 3591.845
CPU max MHz: 4000.
CPU min MHz: 800.
BogoMIPS:7183.72
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:8192K
NUMA node0 CPU(s):   0-7
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb 
invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid 
fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat 
pln pts md_clear flush_l1d




[1] https://github.com/apache/arrow/pull/6019






  1   2   >