Re: CI feedback time

2021-04-15 Thread Jorge Cardoso Leitão
Hi,

I agree.

I'll submit two requirements though:
> - the configuration for CI builds must be kept in the Arrow repository
>(as they are currently in .github, etc.)
> - CI builds must be runnable from PRs
>

I'll submit three more:
- The result of the build (pass / did not pass) must be shown on github's
PRs
- The logs must be public and "clickable" from github
- We must not allow privileged arbitrary code execution from arbitrary users

I POCed Buildkite in January for Rust builds. See ARROW-11140

and corresponding PR https://github.com/apache/arrow/pull/9111. It
fulfilled the above requirements for docker runs.

The runner was running a rootless docker, for all PRs and branches, and
allowed people to register runners on their own repos if they wish so.

Limitations:
1. no macos and windows (no easy way to secure the runner against arbitrary
execution)
2. jobs cannot use sudo and privileged stuff (we would need a separate
queue for these, or e.g. use a user whitelist like Krisztián mentioned)

Best,
Jorge


On Thu, Apr 15, 2021 at 12:28 AM Antoine Pitrou  wrote:

>
> Hi Krisztian,
>
> Thanks for bringing this up.  This is definitely becoming a
> high-priority topic for Arrow development.
>
> I don't believe there is much opportunity for reducing the number of
> builds or their runtime.  We simply have a lot of development going on,
> and the number of different CI jobs we have is simply because we need to
> support many different configurations (and past experience has shown
> that they quickly stop working if we don't monitor them on a regular
> basis).
>
> So I think the only path forward is to build up (== buy, probably) our
> own execution resources for CI.  Whether that entails using Github
> self-hosted runners, Buildkite, or yet another system, I have no idea.
>
> I'll submit two requirements though:
> - the configuration for CI builds must be kept in the Arrow repository
>(as they are currently in .github, etc.)
> - CI builds must be runnable from PRs
>
> Regards
>
> Antoine.
>
>
> Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > Hi,
> >
> > The Apache Github Actions agent pool seems to be oversubscribed as
> > more Apache projects migrate their CI setup to GHA. We experienced
> > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > but now we are roughly 5hrs behind [1].
> >
> > Based on other projects' complaints and discussions [2][3] (doesn't
> > have all the links at hand) we can't expect a short term solution from
> > infra. I think we *need* to figure out something on the project level
> > instead to maintain the overall project health and to improve the
> > development velocity.
> >
> > I don't have a concrete proposal at the moment, but we should start to
> > collect the available options. Ideas?
> >
> > Thanks, Krisztian
> >
> > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > [2]: https://github.com/apache/pulsar/issues/9154
> > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> >
>


[VOTE] Release Apache Arrow 4.0.0 - RC0

2021-04-15 Thread Krisztián Szűcs
Hi,

I would like to propose the following release candidate (RC0) of Apache
Arrow version 4.0.0. This is a release consisting of 671
resolved JIRA issues[1].

This release candidate is based on commit:
3df78d3a98f346ed09667edc5ab551cfeff50b7a [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7].
The changelog is located at [8].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [9] for how to validate a release candidate.
Please use the release-4.0.0 branch for validation [10].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 4.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 4.0.0 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%204.0.0
[2]: 
https://github.com/apache/arrow/tree/3df78d3a98f346ed09667edc5ab551cfeff50b7a
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-4.0.0-rc0
[4]: https://bintray.com/apache/arrow/centos-rc/4.0.0-rc0
[5]: https://bintray.com/apache/arrow/debian-rc/4.0.0-rc0
[6]: https://bintray.com/apache/arrow/python-rc/4.0.0-rc0
[7]: https://bintray.com/apache/arrow/ubuntu-rc/4.0.0-rc0
[8]: 
https://github.com/apache/arrow/blob/3df78d3a98f346ed09667edc5ab551cfeff50b7a/CHANGELOG.md
[9]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[10]: https://github.com/apache/arrow/tree/release-4.0.0


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Fri, Apr 16, 2021 at 1:11 AM Jed Brown  wrote:
>
> Wes McKinney  writes:
>
> > I think we should take a more serious look at Buildkite for some of our CI.
> >
> > * First of all, it's very easy to connect self-hosted workers and
> > supports ephemeral cloud workers in a way that would be difficult or
> > impossible with GHA. No need to have Infra fiddle with the admin
> > dashboard. So we could spin up extra workers during peak hours, or use
> > autoscaling to respond to demand.
> >
> > * We can set up more complex / dependent job pipelines rather than the
> > current GHA monolithic "long list of independent jobs" setup. For
> > example, we could have a fast gatekeeper job for C++ builds (which
> > lints and makes sure that everything compiles) that must pass before
> > more exhaustive longer-running jobs run.
>
> I don't have experience with Buildkite, but note that gitlab-runner is also 
> lightweight and well-featured as above. Here's an example with gatekeeping 
> stages across about 60 environments (mostly on-prem at multiple sites), 
> including explicit "pause-for-approval" to avoid unnecessary time-consuming 
> jobs.
>
> https://gitlab.com/petsc/petsc/-/pipelines/286655535
>
> We also use it for on-prem GPU-equipped CI with repositories hosted on 
> GitHub, reporting status to PRs. The Kubernetes and docker-machine executors 
> are intended for autoscaling.
>
> https://docs.gitlab.com/runner/executors/README.html

The CI technology/service we choose is just one piece of the puzzle.
We need to figure out a sustainable way of funding for the agents/runners.

Sadly we don't have many CIs with free offerings for OSS left to try
(and allowed by INFRA).


Re: CI feedback time

2021-04-15 Thread Jed Brown
Wes McKinney  writes:

> I think we should take a more serious look at Buildkite for some of our CI.
>
> * First of all, it's very easy to connect self-hosted workers and
> supports ephemeral cloud workers in a way that would be difficult or
> impossible with GHA. No need to have Infra fiddle with the admin
> dashboard. So we could spin up extra workers during peak hours, or use
> autoscaling to respond to demand.
>
> * We can set up more complex / dependent job pipelines rather than the
> current GHA monolithic "long list of independent jobs" setup. For
> example, we could have a fast gatekeeper job for C++ builds (which
> lints and makes sure that everything compiles) that must pass before
> more exhaustive longer-running jobs run.

I don't have experience with Buildkite, but note that gitlab-runner is also 
lightweight and well-featured as above. Here's an example with gatekeeping 
stages across about 60 environments (mostly on-prem at multiple sites), 
including explicit "pause-for-approval" to avoid unnecessary time-consuming 
jobs.

https://gitlab.com/petsc/petsc/-/pipelines/286655535

We also use it for on-prem GPU-equipped CI with repositories hosted on GitHub, 
reporting status to PRs. The Kubernetes and docker-machine executors are 
intended for autoscaling.

https://docs.gitlab.com/runner/executors/README.html


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Thu, Apr 15, 2021 at 11:53 PM Andy Grove  wrote:
>
> I started looking at BulidKite and it would solve one large problem for the
> DataFusion/Ballista project. We really need to be running integration tests
> against large data sets (such as TPC-H @ SF=100GB) and self-hosted
> BuildKite makes this simple to accomplish. I even have some modest hardware
> that I purchased specifically for this purpose, but I wasn't confident that
> I could set this up in a secure way that would protect against malicious
> code being submitted. However, if we implement the necessary GitHub hooks
We don't need additional hooks for this particular use case, see
explanation below.
Although INFRA needs to configure hooks for each repository we want to
get commit events from.
For apache/arrow we have already hooked up a buildkite instance at
[3], this should be done for the new repositories as well.

> so that these builds only run after a committer adds an "ok to build"
> comment then I think it would be fine. This is the approach used in Apache
> Spark.
The build needs to query the pull request data from the github API
(since the event payload is not available by default on BK). There is
a field called author association [2] which contains the necessary
information to decide whether a pull request's author is trustworthy.
We already use the same mechanism [1] to handle the comment bot
(@github-actions) requests. Therefore we don't need to explicitly mark
a PR as "ok to build" sparing a manual step.

[1]: https://github.com/apache/arrow/blob/master/dev/archery/archery/bot.py#L98
[2]: https://docs.github.com/en/graphql/reference/enums#commentauthorassociation
[3]: https://buildkite.com/apache-arrow
>
> On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney  wrote:
>
> > I think we should take a more serious look at Buildkite for some of our CI.
> >
> > * First of all, it's very easy to connect self-hosted workers and
> > supports ephemeral cloud workers in a way that would be difficult or
> > impossible with GHA. No need to have Infra fiddle with the admin
> > dashboard. So we could spin up extra workers during peak hours, or use
> > autoscaling to respond to demand.
> >
> > * We can set up more complex / dependent job pipelines rather than the
> > current GHA monolithic "long list of independent jobs" setup. For
> > example, we could have a fast gatekeeper job for C++ builds (which
> > lints and makes sure that everything compiles) that must pass before
> > more exhaustive longer-running jobs run.
> >
> > On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
> >  wrote:
> > >
> > > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace 
> > wrote:
> > > >
> > > > It may be worth reaching out to the Airflow project.  Based on
> > > >
> > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > > > it seems they have been investing time into figuring how to make
> > > > self-hosted runners work (it seems Github's patching model makes this
> > > > somewhat difficult).
> > >
> > > We tried to use github actions self hosted runners previously. Even
> > > though Airflow manages to harden the security issues of the self
> > > hosted runners (which actually affects all hosted agent based CIs like
> > > buildkite as well) registering and managing github agents require
> > > admin privileges on the repository, which we don't have.
> > > In order to register a github self hosted runner we need to exchange
> > > registration tokens with the Apache INFRA team per agent instances.
> > > Further issues:
> > > - a registration token expires in an hour
> > > - troubleshooting the agent<->github communication is not possible
> > > without involving additional INFRA roundtrips.
> > >
> > > >
> > > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou 
> > wrote:
> > > > >
> > > > >
> > > > > Hi Krisztian,
> > > > >
> > > > > Thanks for bringing this up.  This is definitely becoming a
> > > > > high-priority topic for Arrow development.
> > > > >
> > > > > I don't believe there is much opportunity for reducing the number of
> > > > > builds or their runtime.  We simply have a lot of development going
> > on,
> > > > > and the number of different CI jobs we have is simply because we
> > need to
> > > > > support many different configurations (and past experience has shown
> > > > > that they quickly stop working if we don't monitor them on a regular
> > basis).
> > > > >
> > > > > So I think the only path forward is to build up (== buy, probably)
> > our
> > > > > own execution resources for CI.  Whether that entails using Github
> > > > > self-hosted runners, Buildkite, or yet another system, I have no
> > idea.
> > > > >
> > > > > I'll submit two requirements though:
> > > > > - the configuration for CI builds must be kept in the Arrow
> > repository
> > > > >(as they are currently in .github, etc.)
> > > > > - CI builds must be runnable from PRs
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 15/04/2021 à 

Re: CI feedback time

2021-04-15 Thread Andy Grove
I started looking at BulidKite and it would solve one large problem for the
DataFusion/Ballista project. We really need to be running integration tests
against large data sets (such as TPC-H @ SF=100GB) and self-hosted
BuildKite makes this simple to accomplish. I even have some modest hardware
that I purchased specifically for this purpose, but I wasn't confident that
I could set this up in a secure way that would protect against malicious
code being submitted. However, if we implement the necessary GitHub hooks
so that these builds only run after a committer adds an "ok to build"
comment then I think it would be fine. This is the approach used in Apache
Spark.

On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney  wrote:

> I think we should take a more serious look at Buildkite for some of our CI.
>
> * First of all, it's very easy to connect self-hosted workers and
> supports ephemeral cloud workers in a way that would be difficult or
> impossible with GHA. No need to have Infra fiddle with the admin
> dashboard. So we could spin up extra workers during peak hours, or use
> autoscaling to respond to demand.
>
> * We can set up more complex / dependent job pipelines rather than the
> current GHA monolithic "long list of independent jobs" setup. For
> example, we could have a fast gatekeeper job for C++ builds (which
> lints and makes sure that everything compiles) that must pass before
> more exhaustive longer-running jobs run.
>
> On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
>  wrote:
> >
> > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace 
> wrote:
> > >
> > > It may be worth reaching out to the Airflow project.  Based on
> > >
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > > it seems they have been investing time into figuring how to make
> > > self-hosted runners work (it seems Github's patching model makes this
> > > somewhat difficult).
> >
> > We tried to use github actions self hosted runners previously. Even
> > though Airflow manages to harden the security issues of the self
> > hosted runners (which actually affects all hosted agent based CIs like
> > buildkite as well) registering and managing github agents require
> > admin privileges on the repository, which we don't have.
> > In order to register a github self hosted runner we need to exchange
> > registration tokens with the Apache INFRA team per agent instances.
> > Further issues:
> > - a registration token expires in an hour
> > - troubleshooting the agent<->github communication is not possible
> > without involving additional INFRA roundtrips.
> >
> > >
> > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > Hi Krisztian,
> > > >
> > > > Thanks for bringing this up.  This is definitely becoming a
> > > > high-priority topic for Arrow development.
> > > >
> > > > I don't believe there is much opportunity for reducing the number of
> > > > builds or their runtime.  We simply have a lot of development going
> on,
> > > > and the number of different CI jobs we have is simply because we
> need to
> > > > support many different configurations (and past experience has shown
> > > > that they quickly stop working if we don't monitor them on a regular
> basis).
> > > >
> > > > So I think the only path forward is to build up (== buy, probably)
> our
> > > > own execution resources for CI.  Whether that entails using Github
> > > > self-hosted runners, Buildkite, or yet another system, I have no
> idea.
> > > >
> > > > I'll submit two requirements though:
> > > > - the configuration for CI builds must be kept in the Arrow
> repository
> > > >(as they are currently in .github, etc.)
> > > > - CI builds must be runnable from PRs
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > > > Hi,
> > > > >
> > > > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > > > more Apache projects migrate their CI setup to GHA. We experienced
> > > > > pretty solid feedback times (~20-30m) when we originally moved to
> GHA
> > > > > but now we are roughly 5hrs behind [1].
> > > > >
> > > > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > > > have all the links at hand) we can't expect a short term solution
> from
> > > > > infra. I think we *need* to figure out something on the project
> level
> > > > > instead to maintain the overall project health and to improve the
> > > > > development velocity.
> > > > >
> > > > > I don't have a concrete proposal at the moment, but we should
> start to
> > > > > collect the available options. Ideas?
> > > > >
> > > > > Thanks, Krisztian
> > > > >
> > > > > [1]:
> https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > > > [2]: https://github.com/apache/pulsar/issues/9154
> > > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > > > >
>


Re: CI feedback time

2021-04-15 Thread Wes McKinney
I think we should take a more serious look at Buildkite for some of our CI.

* First of all, it's very easy to connect self-hosted workers and
supports ephemeral cloud workers in a way that would be difficult or
impossible with GHA. No need to have Infra fiddle with the admin
dashboard. So we could spin up extra workers during peak hours, or use
autoscaling to respond to demand.

* We can set up more complex / dependent job pipelines rather than the
current GHA monolithic "long list of independent jobs" setup. For
example, we could have a fast gatekeeper job for C++ builds (which
lints and makes sure that everything compiles) that must pass before
more exhaustive longer-running jobs run.

On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
 wrote:
>
> On Thu, Apr 15, 2021 at 2:13 AM Weston Pace  wrote:
> >
> > It may be worth reaching out to the Airflow project.  Based on
> > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > it seems they have been investing time into figuring how to make
> > self-hosted runners work (it seems Github's patching model makes this
> > somewhat difficult).
>
> We tried to use github actions self hosted runners previously. Even
> though Airflow manages to harden the security issues of the self
> hosted runners (which actually affects all hosted agent based CIs like
> buildkite as well) registering and managing github agents require
> admin privileges on the repository, which we don't have.
> In order to register a github self hosted runner we need to exchange
> registration tokens with the Apache INFRA team per agent instances.
> Further issues:
> - a registration token expires in an hour
> - troubleshooting the agent<->github communication is not possible
> without involving additional INFRA roundtrips.
>
> >
> > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou  wrote:
> > >
> > >
> > > Hi Krisztian,
> > >
> > > Thanks for bringing this up.  This is definitely becoming a
> > > high-priority topic for Arrow development.
> > >
> > > I don't believe there is much opportunity for reducing the number of
> > > builds or their runtime.  We simply have a lot of development going on,
> > > and the number of different CI jobs we have is simply because we need to
> > > support many different configurations (and past experience has shown
> > > that they quickly stop working if we don't monitor them on a regular 
> > > basis).
> > >
> > > So I think the only path forward is to build up (== buy, probably) our
> > > own execution resources for CI.  Whether that entails using Github
> > > self-hosted runners, Buildkite, or yet another system, I have no idea.
> > >
> > > I'll submit two requirements though:
> > > - the configuration for CI builds must be kept in the Arrow repository
> > >(as they are currently in .github, etc.)
> > > - CI builds must be runnable from PRs
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > > Hi,
> > > >
> > > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > > more Apache projects migrate their CI setup to GHA. We experienced
> > > > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > > > but now we are roughly 5hrs behind [1].
> > > >
> > > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > > have all the links at hand) we can't expect a short term solution from
> > > > infra. I think we *need* to figure out something on the project level
> > > > instead to maintain the overall project health and to improve the
> > > > development velocity.
> > > >
> > > > I don't have a concrete proposal at the moment, but we should start to
> > > > collect the available options. Ideas?
> > > >
> > > > Thanks, Krisztian
> > > >
> > > > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > > [2]: https://github.com/apache/pulsar/issues/9154
> > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > > >


Re: [Java] Source control of generated flatbuffers code

2021-04-15 Thread Bob Tinsman
OK, I just approved those changes. I was working on a shell script to automate 
it--nice to have, but not necessary. Better that you can get it into 4.0. 
Thanks!

On 2021/04/15 17:33:20, Micah Kornfield  wrote: 
> I took a look and added comments.   I'm not sure if Bob replied off-list,
> so hopefully no work was duplicated.
> 
> Lets try to be mindful that the project is asynchronous in nature and it
> might take a little time to reply.
> 
> Cheers,
> Micah
> 
> On Thu, Apr 15, 2021 at 10:00 AM Nate Bauernfeind <
> natebauernfe...@deephaven.io> wrote:
> 
> > > I think checking in the java files is fine and probably better then
> > relying
> > > on a third party package.  We should make sure there are instructions on
> > > how to regenerate them along with the PR
> >
> > Micah,
> >
> > I just opened a pull-request to satisfy ARROW-12111. This is my first
> > contribution to an apache project; please let me know if there is anything
> > else that I need to do to get this past the finish line.
> >
> > https://github.com/apache/arrow/pull/10058
> >
> > Thanks,
> > Nate
> >
> > On Wed, Apr 14, 2021 at 11:45 PM Nate Bauernfeind <
> > natebauernfe...@deephaven.io> wrote:
> >
> > > Hey Bob,
> > >
> > > Someone did publish a 1.12 version of the flatc maven plugin. I double
> > > checked that the plugin and binaries look correct and legit.. but you
> > know,
> > > it's always shady to download some random executable from the internet
> > and
> > > run it. However, I have been using it to generate the arrow flatbuffer
> > > files because I _really_ wanted some features that are in flatc 1.12's
> > > runtime jar (there are performance improvements for array types in
> > > particular).
> > >
> > > You can see them here:
> > > https://search.maven.org/search?q=com.github.shinanca
> > > The repository fork is here: https://github.com/shinanca/flatc
> > >
> > > On the bright side that developer appears to have published an x86_64
> > > windows binary which might satisfy one of your earlier complaints in the
> > > thread.
> > >
> > > On the other hand, if everyone is comfortable checking in the flatc
> > > generated files (obviously with the additional documentation on how to
> > > re-generate should the fbs files change), it's a relatively small change
> > to
> > > replace the existing apache/arrow/java/format source. Based on the
> > previous
> > > discussion on this thread, it seems that the arrow dev team _is_
> > > comfortable with the check-in-the-generated-files approach.
> > >
> > > Although 4.0 is near the release phase, there are still a few blocking
> > > issues that people are trying to fix (according to the arrow-sync call
> > > earlier today). I don't mind jumping in and doing this; it appears that
> > > there might be enough time for such a small change to make it into the
> > > release if the work is performed and merged ASAP.
> > >
> > > I guess, I'm either looking for the "pull request is on the way" or the
> > > "thumbs up - we definitely want this; I'll get the code review for you
> > when
> > > it's ready" style reply =D.
> > >
> > > On Wed, Apr 14, 2021 at 10:43 PM Bob Tinsman  wrote:
> > >
> > >> I apologize for leaving this hanging, but it looks like 4.0 is leaving
> > >> the station :(
> > >> Yes, it makes sense to bump it to 1.12, but you can't do that in
> > >> isolation, because the flatc binary which is fetched as a Maven
> > dependency
> > >> is only available for 1.9. I will get back onto this and finish it this
> > >> week.
> > >>
> > >> FWIW, I was looking around and catalogued the various ways of generating
> > >> flatbuffers for all the languages--you can look at it in my branch:
> > >> https://github.com/bobtins/arrow/tree/check-in-gen-code/java/dev
> > >> Let me know if any info is wrong or missing.
> > >> The methods of generation are all over the map, and some have no script
> > >> or build file, just doc. Would there be any value in making this more
> > >> uniform?
> > >>
> > >> On 2021/04/14 16:36:47, Nate Bauernfeind 
> > >> wrote:
> > >> > It would also be nice to upgrade that java flatbuffer version from 1.9
> > >> to
> > >> > 1.12. Is anyone planning on doing this work (as listed in
> > ARROW-12111)?
> > >> >
> > >> > If I did this work today, might it be possible to get it included in
> > the
> > >> > 4.0.0 release?
> > >> >
> > >> > On Fri, Mar 26, 2021 at 3:25 PM bobtins  wrote:
> > >> >
> > >> > > OK, originally this was part of
> > >> > > https://issues.apache.org/jira/browse/ARROW-12006 and I was going
> > to
> > >> just
> > >> > > add some doc on flatc, but I will make this a new bug because it's a
> > >> little
> > >> > > bigger: https://issues.apache.org/jira/browse/ARROW-12111
> > >> > >
> > >> > > On 2021/03/23 23:40:50, Micah Kornfield 
> > >> wrote:
> > >> > > > >
> > >> > > > > I have a concern, though. Four other languages (Java would be
> > >> five)
> > >> > > check
> > >> > > > > in the generated flatbuffers code, and it appears (based on a
> > >> quick
> > 

[Rust][Datafusion] Timestamp Millisecond support

2021-04-15 Thread Evan Chan
Hi folks,

So currently Arrow Rust/DataFusion supports four types of Timestamp arrays, 
with Nano, Micro, Millisecond and Second resolution.  However, the best 
supported by far are Nanos.  For example, in DataFusion, the following only 
works for Nanos and not the other resolutions:
* CAST(x as TIMESTAMP)  -> Nanos only
* date_trunc()  -> nanos only
* filtering a timestamp array, eg my column > to_timestamp(‘2020-06-30T12:00Z’)

In the broader SQL world, in general there seems to be only a single Timestamp 
type in most databases, though in many cases there is a variable resolution.  
PostGres’s TIMESTAMP type is microsecond based:  
https://www.postgresql.org/docs/9.1/datatype-datetime.html 


For UrbanLogiq, in some cases we would like to standardize on millisecond 
resolution.  Nanoseconds yields only ~300 years of span for i64, which isn’t 
enough for some applications.  At minimum, we’d like to get the following 
working:
* Either something like CAST(x AS TIMESTAMP(Milliseconds)) or 
date_trunc(‘milliseconds’, x) , which can cast different types of timestamp 
arrays to Timestamp(Milliseconds, None)
* filtering that can compare, ideally, different types of timestamp columns to 
a to_timestamp(….)

The last problem is easy to solve as the coercion logic can be fixed to address 
that, but solving the first problem is not as straightforward.
- There isn’t any universal SQL standard for a type that supports different 
timestamp resolutions
- The most non-intrusive way I can think of is:
- CAST(x AS TIMESTAMP)   -> Nanos
- CAST(x AS TIMESTAMP(Micros/Millis/Seconds)) ->. Micros/Millis/Seconds 
arrays
- Functions are designed/work best with a single output type, and date_trunc() 
is designed to output nanos only.  Fixing this would not be trivial, it would 
probably require changing the signature of return_type() so that return types 
can be determined from argument values, not just argument types

Basically the larger question for the Arrow/DataFusion community is how do we 
want to deal with supporting different timestamp types.  The ideal would be 
that different functions work on all the timestamp types, but it’ll take a long 
time to get there I fear.  Some possible directions:

- Continue current support for nanos as the “first class” citizen, but add 
support for casting to different timestamp resolutions, and coercion to nanos 
to work with different functions like date_trunc().
- This means Arrow would not be usable in some cases.  Converting from 
micros to nanos loses year range (for 64-bits anyways)
- There is a performance penalty for the coercion
- Switch to different basis for the “base” timestamp type, like PostGres did 
with micros
- Use more than 64 bits to represent nanos
- Add support for functions to work with different timestamp types
- For example, cast() produces different timestamp resolutions
- Other date functions like date_trunc() can input from different 
resolutions
- Signatures and return type calculation would be more complex
- Switch to a universal timestamp type which supports different resolutions, as 
some SQL databases support

Thanks for your input,
Evan

Re: [Java] Source control of generated flatbuffers code

2021-04-15 Thread Micah Kornfield
I took a look and added comments.   I'm not sure if Bob replied off-list,
so hopefully no work was duplicated.

Lets try to be mindful that the project is asynchronous in nature and it
might take a little time to reply.

Cheers,
Micah

On Thu, Apr 15, 2021 at 10:00 AM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> > I think checking in the java files is fine and probably better then
> relying
> > on a third party package.  We should make sure there are instructions on
> > how to regenerate them along with the PR
>
> Micah,
>
> I just opened a pull-request to satisfy ARROW-12111. This is my first
> contribution to an apache project; please let me know if there is anything
> else that I need to do to get this past the finish line.
>
> https://github.com/apache/arrow/pull/10058
>
> Thanks,
> Nate
>
> On Wed, Apr 14, 2021 at 11:45 PM Nate Bauernfeind <
> natebauernfe...@deephaven.io> wrote:
>
> > Hey Bob,
> >
> > Someone did publish a 1.12 version of the flatc maven plugin. I double
> > checked that the plugin and binaries look correct and legit.. but you
> know,
> > it's always shady to download some random executable from the internet
> and
> > run it. However, I have been using it to generate the arrow flatbuffer
> > files because I _really_ wanted some features that are in flatc 1.12's
> > runtime jar (there are performance improvements for array types in
> > particular).
> >
> > You can see them here:
> > https://search.maven.org/search?q=com.github.shinanca
> > The repository fork is here: https://github.com/shinanca/flatc
> >
> > On the bright side that developer appears to have published an x86_64
> > windows binary which might satisfy one of your earlier complaints in the
> > thread.
> >
> > On the other hand, if everyone is comfortable checking in the flatc
> > generated files (obviously with the additional documentation on how to
> > re-generate should the fbs files change), it's a relatively small change
> to
> > replace the existing apache/arrow/java/format source. Based on the
> previous
> > discussion on this thread, it seems that the arrow dev team _is_
> > comfortable with the check-in-the-generated-files approach.
> >
> > Although 4.0 is near the release phase, there are still a few blocking
> > issues that people are trying to fix (according to the arrow-sync call
> > earlier today). I don't mind jumping in and doing this; it appears that
> > there might be enough time for such a small change to make it into the
> > release if the work is performed and merged ASAP.
> >
> > I guess, I'm either looking for the "pull request is on the way" or the
> > "thumbs up - we definitely want this; I'll get the code review for you
> when
> > it's ready" style reply =D.
> >
> > On Wed, Apr 14, 2021 at 10:43 PM Bob Tinsman  wrote:
> >
> >> I apologize for leaving this hanging, but it looks like 4.0 is leaving
> >> the station :(
> >> Yes, it makes sense to bump it to 1.12, but you can't do that in
> >> isolation, because the flatc binary which is fetched as a Maven
> dependency
> >> is only available for 1.9. I will get back onto this and finish it this
> >> week.
> >>
> >> FWIW, I was looking around and catalogued the various ways of generating
> >> flatbuffers for all the languages--you can look at it in my branch:
> >> https://github.com/bobtins/arrow/tree/check-in-gen-code/java/dev
> >> Let me know if any info is wrong or missing.
> >> The methods of generation are all over the map, and some have no script
> >> or build file, just doc. Would there be any value in making this more
> >> uniform?
> >>
> >> On 2021/04/14 16:36:47, Nate Bauernfeind 
> >> wrote:
> >> > It would also be nice to upgrade that java flatbuffer version from 1.9
> >> to
> >> > 1.12. Is anyone planning on doing this work (as listed in
> ARROW-12111)?
> >> >
> >> > If I did this work today, might it be possible to get it included in
> the
> >> > 4.0.0 release?
> >> >
> >> > On Fri, Mar 26, 2021 at 3:25 PM bobtins  wrote:
> >> >
> >> > > OK, originally this was part of
> >> > > https://issues.apache.org/jira/browse/ARROW-12006 and I was going
> to
> >> just
> >> > > add some doc on flatc, but I will make this a new bug because it's a
> >> little
> >> > > bigger: https://issues.apache.org/jira/browse/ARROW-12111
> >> > >
> >> > > On 2021/03/23 23:40:50, Micah Kornfield 
> >> wrote:
> >> > > > >
> >> > > > > I have a concern, though. Four other languages (Java would be
> >> five)
> >> > > check
> >> > > > > in the generated flatbuffers code, and it appears (based on a
> >> quick
> >> > > scan of
> >> > > > > Git logs) that this is done manually. Is there a danger that the
> >> binary
> >> > > > > format could change, but some language might get forgotten, and
> >> thus be
> >> > > > > working with the old format?
> >> > > >
> >> > > > The format changes relatively slowly and any changes at this point
> >> should
> >> > > > be backwards compatible.
> >> > > >
> >> > > >
> >> > > >
> >> > > > > Or is there enough interop 

Re: [Java] Source control of generated flatbuffers code

2021-04-15 Thread Nate Bauernfeind
> I think checking in the java files is fine and probably better then
relying
> on a third party package.  We should make sure there are instructions on
> how to regenerate them along with the PR

Micah,

I just opened a pull-request to satisfy ARROW-12111. This is my first
contribution to an apache project; please let me know if there is anything
else that I need to do to get this past the finish line.

https://github.com/apache/arrow/pull/10058

Thanks,
Nate

On Wed, Apr 14, 2021 at 11:45 PM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> Hey Bob,
>
> Someone did publish a 1.12 version of the flatc maven plugin. I double
> checked that the plugin and binaries look correct and legit.. but you know,
> it's always shady to download some random executable from the internet and
> run it. However, I have been using it to generate the arrow flatbuffer
> files because I _really_ wanted some features that are in flatc 1.12's
> runtime jar (there are performance improvements for array types in
> particular).
>
> You can see them here:
> https://search.maven.org/search?q=com.github.shinanca
> The repository fork is here: https://github.com/shinanca/flatc
>
> On the bright side that developer appears to have published an x86_64
> windows binary which might satisfy one of your earlier complaints in the
> thread.
>
> On the other hand, if everyone is comfortable checking in the flatc
> generated files (obviously with the additional documentation on how to
> re-generate should the fbs files change), it's a relatively small change to
> replace the existing apache/arrow/java/format source. Based on the previous
> discussion on this thread, it seems that the arrow dev team _is_
> comfortable with the check-in-the-generated-files approach.
>
> Although 4.0 is near the release phase, there are still a few blocking
> issues that people are trying to fix (according to the arrow-sync call
> earlier today). I don't mind jumping in and doing this; it appears that
> there might be enough time for such a small change to make it into the
> release if the work is performed and merged ASAP.
>
> I guess, I'm either looking for the "pull request is on the way" or the
> "thumbs up - we definitely want this; I'll get the code review for you when
> it's ready" style reply =D.
>
> On Wed, Apr 14, 2021 at 10:43 PM Bob Tinsman  wrote:
>
>> I apologize for leaving this hanging, but it looks like 4.0 is leaving
>> the station :(
>> Yes, it makes sense to bump it to 1.12, but you can't do that in
>> isolation, because the flatc binary which is fetched as a Maven dependency
>> is only available for 1.9. I will get back onto this and finish it this
>> week.
>>
>> FWIW, I was looking around and catalogued the various ways of generating
>> flatbuffers for all the languages--you can look at it in my branch:
>> https://github.com/bobtins/arrow/tree/check-in-gen-code/java/dev
>> Let me know if any info is wrong or missing.
>> The methods of generation are all over the map, and some have no script
>> or build file, just doc. Would there be any value in making this more
>> uniform?
>>
>> On 2021/04/14 16:36:47, Nate Bauernfeind 
>> wrote:
>> > It would also be nice to upgrade that java flatbuffer version from 1.9
>> to
>> > 1.12. Is anyone planning on doing this work (as listed in ARROW-12111)?
>> >
>> > If I did this work today, might it be possible to get it included in the
>> > 4.0.0 release?
>> >
>> > On Fri, Mar 26, 2021 at 3:25 PM bobtins  wrote:
>> >
>> > > OK, originally this was part of
>> > > https://issues.apache.org/jira/browse/ARROW-12006 and I was going to
>> just
>> > > add some doc on flatc, but I will make this a new bug because it's a
>> little
>> > > bigger: https://issues.apache.org/jira/browse/ARROW-12111
>> > >
>> > > On 2021/03/23 23:40:50, Micah Kornfield 
>> wrote:
>> > > > >
>> > > > > I have a concern, though. Four other languages (Java would be
>> five)
>> > > check
>> > > > > in the generated flatbuffers code, and it appears (based on a
>> quick
>> > > scan of
>> > > > > Git logs) that this is done manually. Is there a danger that the
>> binary
>> > > > > format could change, but some language might get forgotten, and
>> thus be
>> > > > > working with the old format?
>> > > >
>> > > > The format changes relatively slowly and any changes at this point
>> should
>> > > > be backwards compatible.
>> > > >
>> > > >
>> > > >
>> > > > > Or is there enough interop testing that the problem would get
>> caught
>> > > right
>> > > > > away?
>> > > >
>> > > > In most cases I would expect integration tests to catch these types
>> of
>> > > > error.
>> > > >
>> > > > On Tue, Mar 23, 2021 at 4:26 PM bobtins  wrote:
>> > > >
>> > > > > I'm happy to check in the generated Java source. I would also
>> update
>> > > the
>> > > > > Java build info to reflect this change and document how to
>> regenerate
>> > > the
>> > > > > source as needed.
>> > > > >
>> > > > > I have a concern, though. Four other languages (Java would be
>> 

Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread paddy horan
+1



From: Joris Van den Bossche 
Sent: Thursday, April 15, 2021 10:07:27 AM
To: dev 
Subject: Re: [VOTE] Move Rust components to new repos and process

+1 (non-binding)

Joris

On Thu, 15 Apr 2021 at 15:42, Wes McKinney  wrote:

> +1 (binding)
>
> On Thu, Apr 15, 2021 at 7:31 AM Weston Steimel 
> wrote:
> >
> > +1
> >
> > On Thu, 15 Apr 2021 at 00:05, Andy Grove  wrote:
> >
> > > This vote is to determine if the Arrow PMC is in favor of the Rust
> > > community moving the Rust implementation of Apache Arrow as well as the
> > > related projects (such as Parquet, DataFusion, Ballista, etc) out of
> the
> > > monorepo and into two new repositories, as outlined in the proposal
> > > document [1].
> > >
> > > Please vote whether to accept the proposal and allow the Rust
> community to
> > > proceed with the work.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 : Accept the proposal
> > >
> > > [ ] 0 : No opinion
> > >
> > > [ ] -1 : Reject proposal because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > [1]
> > >
> > >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI%2Fedit%3Fusp%3Dsharingdata=04%7C01%7C%7Cb9f01171b45d4f9259da08d90017dc13%7C84df9e7fe9f640afb435%7C1%7C0%7C637540924766988151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=izFW%2B4NiT%2F28S%2BtU6TOOrBrj1Vcfqu%2FVT3LvDi2tFvQ%3Dreserved=0
> > >
>


Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Joris Van den Bossche
+1 (non-binding)

Joris

On Thu, 15 Apr 2021 at 15:42, Wes McKinney  wrote:

> +1 (binding)
>
> On Thu, Apr 15, 2021 at 7:31 AM Weston Steimel 
> wrote:
> >
> > +1
> >
> > On Thu, 15 Apr 2021 at 00:05, Andy Grove  wrote:
> >
> > > This vote is to determine if the Arrow PMC is in favor of the Rust
> > > community moving the Rust implementation of Apache Arrow as well as the
> > > related projects (such as Parquet, DataFusion, Ballista, etc) out of
> the
> > > monorepo and into two new repositories, as outlined in the proposal
> > > document [1].
> > >
> > > Please vote whether to accept the proposal and allow the Rust
> community to
> > > proceed with the work.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 : Accept the proposal
> > >
> > > [ ] 0 : No opinion
> > >
> > > [ ] -1 : Reject proposal because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > [1]
> > >
> > >
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing
> > >
>


Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Wes McKinney
+1 (binding)

On Thu, Apr 15, 2021 at 7:31 AM Weston Steimel  wrote:
>
> +1
>
> On Thu, 15 Apr 2021 at 00:05, Andy Grove  wrote:
>
> > This vote is to determine if the Arrow PMC is in favor of the Rust
> > community moving the Rust implementation of Apache Arrow as well as the
> > related projects (such as Parquet, DataFusion, Ballista, etc) out of the
> > monorepo and into two new repositories, as outlined in the proposal
> > document [1].
> >
> > Please vote whether to accept the proposal and allow the Rust community to
> > proceed with the work.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 : Accept the proposal
> >
> > [ ] 0 : No opinion
> >
> > [ ] -1 : Reject proposal because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> >
> > Andy.
> >
> > [1]
> >
> > https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing
> >


Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Weston Steimel
+1

On Thu, 15 Apr 2021 at 00:05, Andy Grove  wrote:

> This vote is to determine if the Arrow PMC is in favor of the Rust
> community moving the Rust implementation of Apache Arrow as well as the
> related projects (such as Parquet, DataFusion, Ballista, etc) out of the
> monorepo and into two new repositories, as outlined in the proposal
> document [1].
>
> Please vote whether to accept the proposal and allow the Rust community to
> proceed with the work.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 : Accept the proposal
>
> [ ] 0 : No opinion
>
> [ ] -1 : Reject proposal because...
>
> Here is my vote: +1
>
> Thanks,
>
> Andy.
>
> [1]
>
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing
>


Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Antoine Pitrou



+0.

Regards

Antoine.


Le 15/04/2021 à 02:04, Andy Grove a écrit :

This vote is to determine if the Arrow PMC is in favor of the Rust
community moving the Rust implementation of Apache Arrow as well as the
related projects (such as Parquet, DataFusion, Ballista, etc) out of the
monorepo and into two new repositories, as outlined in the proposal
document [1].

Please vote whether to accept the proposal and allow the Rust community to
proceed with the work.

The vote will be open for at least 72 hours.

[ ] +1 : Accept the proposal

[ ] 0 : No opinion

[ ] -1 : Reject proposal because...

Here is my vote: +1

Thanks,

Andy.

[1]
https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing



Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Thu, Apr 15, 2021 at 2:13 AM Weston Pace  wrote:
>
> It may be worth reaching out to the Airflow project.  Based on
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> it seems they have been investing time into figuring how to make
> self-hosted runners work (it seems Github's patching model makes this
> somewhat difficult).

We tried to use github actions self hosted runners previously. Even
though Airflow manages to harden the security issues of the self
hosted runners (which actually affects all hosted agent based CIs like
buildkite as well) registering and managing github agents require
admin privileges on the repository, which we don't have.
In order to register a github self hosted runner we need to exchange
registration tokens with the Apache INFRA team per agent instances.
Further issues:
- a registration token expires in an hour
- troubleshooting the agent<->github communication is not possible
without involving additional INFRA roundtrips.

>
> On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou  wrote:
> >
> >
> > Hi Krisztian,
> >
> > Thanks for bringing this up.  This is definitely becoming a
> > high-priority topic for Arrow development.
> >
> > I don't believe there is much opportunity for reducing the number of
> > builds or their runtime.  We simply have a lot of development going on,
> > and the number of different CI jobs we have is simply because we need to
> > support many different configurations (and past experience has shown
> > that they quickly stop working if we don't monitor them on a regular basis).
> >
> > So I think the only path forward is to build up (== buy, probably) our
> > own execution resources for CI.  Whether that entails using Github
> > self-hosted runners, Buildkite, or yet another system, I have no idea.
> >
> > I'll submit two requirements though:
> > - the configuration for CI builds must be kept in the Arrow repository
> >(as they are currently in .github, etc.)
> > - CI builds must be runnable from PRs
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > Hi,
> > >
> > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > more Apache projects migrate their CI setup to GHA. We experienced
> > > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > > but now we are roughly 5hrs behind [1].
> > >
> > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > have all the links at hand) we can't expect a short term solution from
> > > infra. I think we *need* to figure out something on the project level
> > > instead to maintain the overall project health and to improve the
> > > development velocity.
> > >
> > > I don't have a concrete proposal at the moment, but we should start to
> > > collect the available options. Ideas?
> > >
> > > Thanks, Krisztian
> > >
> > > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > [2]: https://github.com/apache/pulsar/issues/9154
> > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > >


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Thu, Apr 15, 2021 at 10:48 AM Antoine Pitrou  wrote:
>
>
> Le 15/04/2021 à 03:13, Kazuaki Ishizaki a écrit :
> > As we know this is a common issue among Apache projects. While the
> > projects do not have the final solution, Apache Spark project has a
> > mechanism [1][2] to run a test in own local (forked) repository. Can we
> > alleviate the problem a little bit?
>
> Anyone can already enable AppVeyor, Travis-CI and Github Actions on
> their own fork.  There is no particular action to do here.

There is a slight but meaningful difference. The fork is building the
pull request's branch (refs/pull//head) whereas the pull
request builds a reference created by github by merging the fork's
branch to the pull request's base branch (refs/pull//merge).
If we would merge based on the fork's CI status we may have issues on
the main branch after the merge.
This is what the spark pull request does, it merges [1] the pull
request's branch with the pull request's base branch.

[1]: 
https://github.com/apache/spark/pull/29504/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R99
>
> Regards
>
> Antoine.


Re: [Rust] [DataFusion] Proposal for datafusion test reorganization

2021-04-15 Thread Andrew Lamb
Thanks Daniël,

I'll write up a more formal proposal / jira in the upcoming days

Andrew

On Tue, Apr 13, 2021 at 11:37 AM Daniël Heres  wrote:

> Late reply, but I agree these tests modules need a bit of reorganization. I
> also found myself adding tests to context.rs / sql.rs just because
> related/similar tests are included there.
>
> Sounds like a good reorganization to me!
>
> On Fri, Apr 9, 2021, 20:44 Andrew Lamb  wrote:
>
> > As Jorge points out here [1], the tests in datafusion/src/context.rs are
> > not really unit tests. They are more like SQL integration tests. There is
> > also a small and languishing set of sql tests in `rust/datafusion/tests/
> > sql.rs`.
> >
> > These tests are critical for DataFusion's quality and I would like to
> > propose a small reorganization so it is easier to find existing test
> > coverage and write new ones
> >
> > Specifically I propose:
> > 1. move `rust/datafusion/src/test` to its own module `rust/test_helpers`
> > (so that it can be shared with sql.rs)
> > 2. Update the style of all sql.rs tests to be inline with that in
> > context.rs
> > (using assert_batches_eq!)
> > 3. Move tests that are not specific to `ExecutionContext` out of
> > context.rs
> > and into sql.rs
> >
> > Then over time I imagine being able to organize the tests within sql.rs
> > better (split into multiple modules, for example)
> >
> > If no one objects, I'll write up some JIRA tickets and start trying to
> move
> > in this direction
> >
> > Thanks,
> > Andrew
> >
> > [1]
> https://github.com/apache/arrow/pull/9936#pullrequestreview-632020250
> >
>


Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Daniël Heres
+1

Op do 15 apr. 2021 om 12:37 schreef Andrew Lamb :

> +1
>
> On Thu, Apr 15, 2021 at 1:17 AM Fernando Herrera <
> fernando.j.herr...@gmail.com> wrote:
>
> > +1
> >
> > On Thu, 15 Apr 2021, 05:57 Sutou Kouhei,  wrote:
> >
> > > +1
> > >
> > > In  >
> > >   "[VOTE] Move Rust components to new repos and process" on Wed, 14 Apr
> > > 2021 18:04:44 -0600,
> > >   Andy Grove  wrote:
> > >
> > > > This vote is to determine if the Arrow PMC is in favor of the Rust
> > > > community moving the Rust implementation of Apache Arrow as well as
> the
> > > > related projects (such as Parquet, DataFusion, Ballista, etc) out of
> > the
> > > > monorepo and into two new repositories, as outlined in the proposal
> > > > document [1].
> > > >
> > > > Please vote whether to accept the proposal and allow the Rust
> community
> > > to
> > > > proceed with the work.
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 : Accept the proposal
> > > >
> > > > [ ] 0 : No opinion
> > > >
> > > > [ ] -1 : Reject proposal because...
> > > >
> > > > Here is my vote: +1
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing
> > >
> >
>


-- 
Daniël Heres


Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Andrew Lamb
+1

On Thu, Apr 15, 2021 at 1:17 AM Fernando Herrera <
fernando.j.herr...@gmail.com> wrote:

> +1
>
> On Thu, 15 Apr 2021, 05:57 Sutou Kouhei,  wrote:
>
> > +1
> >
> > In 
> >   "[VOTE] Move Rust components to new repos and process" on Wed, 14 Apr
> > 2021 18:04:44 -0600,
> >   Andy Grove  wrote:
> >
> > > This vote is to determine if the Arrow PMC is in favor of the Rust
> > > community moving the Rust implementation of Apache Arrow as well as the
> > > related projects (such as Parquet, DataFusion, Ballista, etc) out of
> the
> > > monorepo and into two new repositories, as outlined in the proposal
> > > document [1].
> > >
> > > Please vote whether to accept the proposal and allow the Rust community
> > to
> > > proceed with the work.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 : Accept the proposal
> > >
> > > [ ] 0 : No opinion
> > >
> > > [ ] -1 : Reject proposal because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > [1]
> > >
> >
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing
> >
>


[NIGHTLY] Arrow Build Report for Job nightly-2021-04-15-0

2021-04-15 Thread Crossbow


Arrow Build Report for Job nightly-2021-04-15-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0

Failed Tasks:
- conda-linux-gcc-py36-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-drone-conda-linux-gcc-py36-arm64
- conda-linux-gcc-py37-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-drone-conda-linux-gcc-py37-arm64
- conda-linux-gcc-py38-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-drone-conda-linux-gcc-py38-arm64
- conda-linux-gcc-py39-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-drone-conda-linux-gcc-py39-arm64
- conda-osx-clang-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-osx-clang-py38
- gandiva-jar-ubuntu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-gandiva-jar-ubuntu
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-test-conda-python-3.8-jpype
- test-conda-python-3.8-spark-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-test-conda-python-3.8-spark-master
- test-conda-python-3.9:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-test-conda-python-3.9
- test-r-linux-as-cran:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-test-r-linux-as-cran
- test-r-minimal-build:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-test-r-minimal-build
- test-ubuntu-20.10-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-test-ubuntu-20.10-docs

Pending Tasks:
- debian-buster-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-travis-debian-buster-arm64
- wheel-manylinux2014-cp36-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-travis-wheel-manylinux2014-cp36-arm64
- wheel-manylinux2014-cp38-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-travis-wheel-manylinux2014-cp38-arm64
- wheel-manylinux2014-cp39-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-travis-wheel-manylinux2014-cp39-arm64

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-github-centos-8-amd64
- centos-8-arm64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-travis-centos-8-arm64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-clean
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-linux-gcc-py39-cuda
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-04-15-0-azure-conda-osx-clang-py37-r40
- 

Re: CI feedback time

2021-04-15 Thread Antoine Pitrou



Le 15/04/2021 à 03:13, Kazuaki Ishizaki a écrit :

As we know this is a common issue among Apache projects. While the
projects do not have the final solution, Apache Spark project has a
mechanism [1][2] to run a test in own local (forked) repository. Can we
alleviate the problem a little bit?


Anyone can already enable AppVeyor, Travis-CI and Github Actions on 
their own fork.  There is no particular action to do here.


Regards

Antoine.


Re: Status of Arrow Julia implementation?

2021-04-15 Thread Sutou Kouhei
Hi Jacob,

OK. Here is my plan:

  1. We wait for the Rust's move to complete
  2. We use a process similar to the Rust's move


Thanks,
--
kou

In 
  "Re: Status of Arrow Julia implementation?" on Wed, 14 Apr 2021 08:37:41 
-0600,
  Jacob Quinn  wrote:

> Thank you kou! I appreciate the help. I'm happy to do whatever is required
> to facilitate the moving/donating process from JuliaData/Arrow.jl to
> apache/arrow-julia.
> 
> -Jacob
> 
> On Mon, Apr 12, 2021 at 7:53 PM Sutou Kouhei  wrote:
> 
>> Hi Jacob,
>>
>> I, a PMC member, talked to Kenta Murata, a commiter and a
>> Julia user, about this.
>>
>> We support that you and Julia folks work on
>> arrow/arrow-julia until we have enough PMC members from
>> Julia folks. For example, we'll help IP clearance process to
>> import the latest JuliaData/Arrow.js changes to apache/ and
>> we'll start voting on Julia package release.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "Re: Status of Arrow Julia implementation?" on Sun, 11 Apr 2021 23:06:27
>> -0600,
>>   Jacob Quinn  wrote:
>>
>> > Micah/Wes,
>> >
>> > Yes, I've been following the rust proposal thread with great interest. I
>> do
>> > think that provides a great path forward: transferring the
>> > JuliaData/Arrow.jl repo to apache/arrow-julia would help to solve the
>> > "package history" technical challenges that in part led to the current
>> > setup and concerns. I think being able to utilize github issues would
>> also
>> > be great; as I've mentioned elsewhere, it's much more
>> traditional/expected
>> > in the Julia ecosystem.
>> >
>> > I think the package could retain an independent versioning scheme. The
>> >> additional process would be voting on release candidates. If the Julia
>> >> folks want to try again and move development to a new, Julia-specific
>> >> apache/* repository and apply the ASF governance to the project, the
>> >> Arrow PMC could probably fast-track making Jacob a committer. In some
>> >> code donations / IP clearance, the contributors for the donated code
>> >> become committers as part of the transaction.
>> >>
>> >
>> > These all sound great and would greatly facilitate a better integration
>> > under ASF governance. These points definitely resolve my main concerns.
>> >
>> > As I commented on the rust thread, I'm mostly interested in the future of
>> > integration testing for rust/julia if they are split out into separate
>> > repos. In the current Julia implementation, we have all the code to read
>> > arrow json, and I just hand-generated the integration test data and
>> > committed them in the repo itself, but it doesn't interface with other
>> > languages (just reads arrow json, produces arrow file, reads arrow file,
>> > compares w/ original arrow json). I'm happy to help work on the details
>> of
>> > what that looks like and pilot some solutions. I think with a solid
>> > inter-repo integration testing framework, we can keep a strong sync
>> between
>> > projects.
>> >
>> > -Jacob
>> >
>> >
>> > On Sun, Apr 11, 2021 at 5:08 PM Wes McKinney 
>> wrote:
>> >
>> >> On Sat, Apr 10, 2021 at 4:07 PM Micah Kornfield 
>> >> wrote:
>> >> >
>> >> > >
>> >> > > Ok, I've had a chance to discuss with a few other Julia developers
>> and
>> >> > > review various options. I think it's best to drop the Julia code
>> from
>> >> the
>> >> > > physical apache/arrow repo. The extra overhead on development,
>> release
>> >> > > process, and user issue reporting and PR contributing are too much
>> in
>> >> > > addition to the technical challenges that we never resolved
>> involving
>> >> > > including the past Arrow.jl release version git trees in the
>> >> apache/arrow
>> >> > > repo.
>> >> >
>> >> >
>> >> > Hi Jacob,
>> >> > It seems you are on the new thread discussing a proposal for changing
>> >> > Rust's development model.   Would the proposal [1] address most of
>> these
>> >> > concerns if Julia was set up in the same way?
>> >> >
>> >> >  It seems in the short term the stickiest point would be committer
>> access
>> >> > to the new repos, and I suppose the release mechanics still might be
>> >> > challenging?
>> >>
>> >> I think the package could retain an independent versioning scheme. The
>> >> additional process would be voting on release candidates. If the Julia
>> >> folks want to try again and move development to a new, Julia-specific
>> >> apache/* repository and apply the ASF governance to the project, the
>> >> Arrow PMC could probably fast-track making Jacob a committer. In some
>> >> code donations / IP clearance, the contributors for the donated code
>> >> become committers as part of the transaction.
>> >>
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> > [1]
>> >> >
>> >>
>> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit
>> >> >
>> >> > On Wed, Apr 7, 2021 at 4:17 AM Wes McKinney 
>> wrote:
>> >> >
>> >> > > I went back and read the mailing list discussions from September
>> about
>> >> > > the donation and I would say there was