[Governance] [Proposal] Stop force-pushing to PRs after release?

2020-11-24 Thread Jorge Cardoso Leitão
Hi,

Based on a discussion on PR #8481, I would like to raise a concern around
git and the post-actions of a release. The background is that I was really
confused that someone has force-pushed to a PR that I fielded, re-writing
its history and causing the PR to break.

@wes and @kszucs quickly explained to me that this is a normal practice in
this project on every release, to which I was a bit astonished.

AFAIK, in open source, there is a strong expectation that PRs are managed
by individual contributors, and committers of the project only request
contributors to make changes to them, or kindly ask before pushing (not
force-pushing) directly to the PR.

IMO, by force-pushing to PRs, we are inverting all expectations and
sometimes even breaking PRs without consent from the contributor. This
drives any reasonable contributor to be pissed off at the team for what we
just did after a release:

   - force-pushed to master
   - force-pushed to their PRs
   - broke their PRs's CI
   - no prior notice or request of any of the above

IMO this is confusing and goes against what virtually every open source
project does. This process also puts a lot of strain in our CIs when we
have an average of 100 open PRs and force-push to all of them at once.

As such, I would like to propose a small change in the post-release process
and to the development practices more generally:

   1. stop force-pushing to others' PRs
   2. stop pushing to others' PRs without their explicit consent
   3. document in the contributing guidelines
   
   that master is force-pushed on every release, and the steps that
   contributors need to take to bring their PRs to the latest master

The underlying principles here are:

   - it is the contributor's responsibility to keep the PRs in a "ready to
   merge" state, rebasing them to master as master changes.
   - A force-push to master corresponds to a change in master
   - thus it is the contributor's responsibility to rebase their PRs
   against the new master.

I understand the argument that it is a burden for the contributors to keep
PRs up-to-date. However, I do not think that this justifies breaking one of
the most basic assumptions that a contributor has on an open source
project. Furthermore, they already have to do it anyways whenever the
master changes with breaking changes: the contributor's process is already
"git fetch upstream && git rebase upstream/master" whenever master changes.
Whether it changes due to a normal push or a force-push does not really
affect this burden when compared to when a merge conflict emerges.

Any thoughts?
Best,
Jorge


Re: [DISCUSS] Memory alignment in rust - what to do?

2020-11-24 Thread Jorge Cardoso Leitão
To bring closure to this thread:

#8401  implements the necessary
functionality to import and export from and to the C data interface,
includes an integration test
to run these against the API provided by pyarrow, and the CI is green.

Special thanks to Antoine Pitrou that was instrumental in guiding me
through this.

Best,
Jorge


On Thu, Sep 24, 2020 at 7:25 PM Antoine Pitrou  wrote:

>
> Le 24/09/2020 à 17:18, Antoine Pitrou a écrit :
> >> 1. in pyarrow, I was only able to find Array.from_buffers and
> from_pandas.
> >> Is the ABI implemented but not documented?
> >
> > It is implemented in C++ and also exposed (but undocumented) in Python.
> >
> > The Python methods are called `Array._import_from_c`,
> > `Array._export_to_c`, likewise for `Schema` and `RecordBatch`.
> >
> > You can find the source for Array methods here:
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L1201
> >
> > There are ad-hoc tests for Python here:
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_cffi.py
> >
> > The core C++ implementation is exported and documented here:
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.h
>
> Ah, and there are a couple of low-level C helpers you may find useful
> here, if you want to reimplement the same operations in Rust:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/helpers.h
>
> Regards
>
> Antoine.
>


Re: Sort int tuples across Arrow arrays in C++

2020-11-24 Thread Sutou Kouhei
Hi,

Multi-column sort_indices on record batch has been implemented:
  https://github.com/apache/arrow/pull/8612

You'll be able to use it with Apache Arrow 3.0.0.


Thanks,
--
kou

In 
  "Sort int tuples across Arrow arrays in C++" on Thu, 3 Sep 2020 14:26:09 
+0200,
  Rares Vernica  wrote:

> Hello,
> 
> I have a set of integer tuples that need to be collected and sorted at a
> coordinator. Here is an example with tuples of length 2:
> 
> [(1, 10),
>  (1, 15),
>  (2, 10),
>  (2, 15)]
> 
> I am considering storing each column in an Arrow array, e.g., [1, 1, 2, 2]
> and [10, 15, 10, 15], and have the Arrow arrays grouped in a Record Batch.
> Then I would serialize, transfer, and deserialize each record batch. The
> coordinator would collect all the record batches and concatenate them.
> Finally, the coordinator needs to sort the tuples by value in the
> sequential order of the columns, e.g., (1, 10), (1, 15), (2, 10).
> 
> Could I accomplish the sort using the Arrow API? I looked at sort_indices
> but it does not work on record batches. With a set of sort indices for each
> array, sorting the tuples does not seem to be straightforward, right?
> 
> Thanks!
> Rares


Re: [Discuss] Should dense union offsets be always increasing?

2020-11-24 Thread Antoine Pitrou


Hello all,

Does anybody else want to give an opinion on this?

Thank you

Antoine.


On Tue, 17 Nov 2020 12:28:06 +0100
Antoine Pitrou  wrote:
> Hello,
> 
> The format spec and the C++ implementation disagree on one point:
> 
> * The spec says that dense union offsets should be increasing:
> """The respective offsets for each child value array must be in order /
> increasing."""
> 
> (from https://arrow.apache.org/docs/format/Columnar.html#dense-union)
> 
> * The C++ implementation has long had some tests that used deliberatly
> non-increasing (even descending) dense union offsets.
> 
> (see https://issues.apache.org/jira/browse/ARROW-10580)
> 
> I don't know what other implementations, especially Java, expect.
> 
> There are obviously two possible solutions:
> 
> 1) Fix the C++ implementation and its tests to conform to the format
> spec (which may break compatibility for code producing / consuming dense
> unions with non-increasing offsets)
> 
> 2) Relax the format spec to allow arbitrary offsets (which could make
> dense union more like a polymorphic dictionary).
> 
> If the first solution is chosen, then another question arises: must the
> offsets be strictly increasing?  Or can a given offset appear several
> times in a row?
> (the latter is currently exploited by the C++ implementation: when
> appending several nulls to a DenseUnionBuilder, only one child null slot
> is added and the same offset is appended multiple times)
> 
> Regards
> 
> Antoine.
> 





Re: [CI] Github Actions statistics

2020-11-24 Thread Jorge Cardoso Leitão
Hi,

Thanks a lot for sharing these.

I am looking through the tests that we run, and how we run them, as I would
really like to take a hit at it. However, I can't commit to this without
some agreement.

I took a hard look at archery and most of our builds, and these are my
observations:

* we heavily rely on docker image caches to store artifacts, by using the
cache action on `.docker`, as well as pushing images to a registry and
fetching them.
* we use a docker-compose file to enumerate all our builds (currently +1k
LOC)
* we use a custom made Python package (archery) for an heterogeneous set of
tasks (release, merge PRs, run docker-compose, run docker)

Let's evaluate the execution path of one of our major runs, the integration
tests that run on every push:

1. build is triggered by the workflow `integration.yaml` on every push, and
on every change to every implementation
2. this installs Python, archery, docker-compose and runs `archery docker
run conda-integration`
3. this calls the equivalent of `docker-compose run conda-integration`
4. this:
4.1 builds a docker image `conda-integration.dockerfile` that contains
Python, conda, Archery, Go, maven, rust, node, all of these installed via
conda
4.2 uses this image to build every implementation (using `docker run
CMD_TO_BUILD_ALL`)
5. runs all integration tests

Step 1-4 takes 20-35m and step 5 takes 5m, irrespectively of any code
changed. IMO there is a potential for a major improvement here.

Some opinionated observations that demotivated me from progressing:

1. The current setup tightly couples the build of all implementations,
making it difficult to refactor and simplify. I.e. We have one docker image
to build all implementations, and build them all in a single command
2. we use conda to install dependencies such as maven, node, jdk and go
3. We use Python/archery for almost everything, even when a simpler
`docker-compose run X` would suffice

With this said, I have two changes to the current design that I would like
to work, if there is buy-in for the general ideas:

1. make every artifact an independent build.

The integration test can be broadly described by a DAG with the following
link list:

 cpp artifacts <- test result
 js artifacts <- test result
 go artifacts <- test result
 rust artifacts <- test result
 ...

My suggestion is that instead of running 1 job that builds all of these
artifacts at once + the test execution, we use N+1 jobs that build each
artifact independently and the job "test" picks these (cached via the cache
flow) and runs the actual test. This segmentation will allow us to cache
the artifacts when code does not change, which will significantly improve
the aforementioned performance issue.

2. Make every build environment dedicated to what it is being built

I.e. Instead of preparing 1 docker image that builds all of these artifacts
at once, we prepare N docker images that build each of the artifacts. I.e.
use a docker image to build rust, one to build go, one to build c++, etc.
This eliminates the tight coupling that currently exists between building
these implementations.

Note that I see this as a stop gap. IMO we should use the artifacts built
from each individual implementation on the integration tests by sharing the
artifacts  (and not even run integration tests if the artifact cannot be
produced / compilation error) instead of building them twice.

Any thoughts?

Best,
Jorge






On Mon, Nov 23, 2020 at 9:58 PM Krisztián Szűcs 
wrote:

> On Mon, Nov 23, 2020 at 3:38 PM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > (sorry, disregard the previous e-mail, I pressed the Send button too
> early)
> >
> > The folks at the apache-builds mailing-list gathered some statistics
> > about GHA usage of various Apache projects:
> >
> >
> https://mail-archives.apache.org/mod_mbox/www-builds/202011.mbox/%3CCADe6CU_a5_HhGNFNGGYwfCdJR0-yPxOuAwnKxaPRvnOOPp86sA%40mail.gmail.com%3E
> >
> >
> https://docs.google.com/spreadsheets/d/1SE9HIHBPmTZuW1WAgdVbEcGouGesiyrnXDIZxx25RSE/edit#gid=0
> >
> > It seems Arrow is the third biggest consumer of Apache GHA CI resources,
> > if measured by median number of in-progress workflow runs.
> > (I'm not sure whether this measures individual jobs, or if several jobs
> > are counted as a single workflow, given that GHA has a rather bizarre
> model)
> Thanks for the heads up!
>
> We have a high queued max value because of the post-release mass PR
> rebase script which distorts the average values as well.
> Based on the medians I don't think that we extremely overuse our GHA
> capacity portion.
>
> On the other hand we can remove a couple of low priority builds (or
> schedule them as nightlies).
>
> Regards, Krisztian
> >
> > Regards
> >
> > Antoine.
> >
>


Re: ursa-labs/crossbow on travis-ci.com is disabled

2020-11-24 Thread Antoine Pitrou


Le 24/11/2020 à 13:42, Krisztián Szűcs a écrit :
> Another alternative would be to setup crossbow for other accounts as well.

You'd hit the limit rather quickly anyway, no?

I think we'll either have to buy CPU time from Travis-CI, or move away
entirely from it.

Regards

Antoine.


Re: ursa-labs/crossbow on travis-ci.com is disabled

2020-11-24 Thread Krisztián Szűcs
Another alternative would be to setup crossbow for other accounts as well.

On Tue, Nov 24, 2020 at 1:36 PM Krisztián Szűcs
 wrote:
>
> On Tue, Nov 24, 2020 at 5:29 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > It seems that ursa-labs/crossbow on travis-ci.com has been
> > disabled since 2020-11-22:
> >
> >   https://travis-ci.com/github/ursa-labs/crossbow/builds
> >
> > > Builds have been temporarily disabled for public
> > > repositories due to a negative credit balance. Please go
> > > to the Plan page to replenish your credit balance or alter
> > > your Consume paid credits for OSS setting.
> >
> >
> > Could someone in Ursa Labs confirm this?
> Confirmed, we already have a negative credit balance due to travis'
> new billing strategy.
> The macos wheels quickly consume the credit based free tier, so travis
> disables even the linux builds.
>
> I think we should migrate away from travis to gha or azure, drawbacks:
> - the wheel scripts are tailored for travis
> - only amd64 arch
> >
> >
> > Thanks,
> > --
> > kou


Re: ursa-labs/crossbow on travis-ci.com is disabled

2020-11-24 Thread Krisztián Szűcs
On Tue, Nov 24, 2020 at 5:29 AM Sutou Kouhei  wrote:
>
> Hi,
>
> It seems that ursa-labs/crossbow on travis-ci.com has been
> disabled since 2020-11-22:
>
>   https://travis-ci.com/github/ursa-labs/crossbow/builds
>
> > Builds have been temporarily disabled for public
> > repositories due to a negative credit balance. Please go
> > to the Plan page to replenish your credit balance or alter
> > your Consume paid credits for OSS setting.
>
>
> Could someone in Ursa Labs confirm this?
Confirmed, we already have a negative credit balance due to travis'
new billing strategy.
The macos wheels quickly consume the credit based free tier, so travis
disables even the linux builds.

I think we should migrate away from travis to gha or azure, drawbacks:
- the wheel scripts are tailored for travis
- only amd64 arch
>
>
> Thanks,
> --
> kou


[NIGHTLY] Arrow Build Report for Job nightly-2020-11-24-0

2020-11-24 Thread Crossbow


Arrow Build Report for Job nightly-2020-11-24-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0

Failed Tasks:
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-nuget
- test-conda-python-3.7-spark-branch-3.0:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-test-conda-python-3.7-spark-branch-3.0
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-test-conda-python-3.8-jpype
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-test-ubuntu-18.04-docs

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-clean
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-osx-clang-py37-r40
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-osx-clang-py38
- conda-win-vs2017-py36-r36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-win-vs2017-py36-r36
- conda-win-vs2017-py37-r40:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-win-vs2017-py37-r40
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-debian-stretch-arm64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-github-example-cpp-minimal-build-static
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-24-0-travis-hom