Re: [DISCUSS] Release cadence and release vote conventions

2019-07-31 Thread Sutou Kouhei
Hi,

Sorry for not replying this thread.

I think that the biggest problem is related to our Java
package.


We'll be able to resolve the GPG key problem by creating a
GPG key only for nightly release test. We can share the test
GPG key publicly because it's a just for testing.

It'll work for our binary artifacts and APT/Yum repositories
but not work for our Java package. I don't know where GPG
key is used in our Java package...


We'll be able to resolve the Git commit problem by creating
a cloned Git repository for test. It's done in our
dev/release/00-prepare-test.rb[1].

[1] 
https://github.com/apache/arrow/blob/master/dev/release/00-prepare-test.rb#L30 

The biggest problem for the Git commit is our Java package
requires "apache-arrow-${VERSION}" tag on
https://github.com/apache/arrow . (Right?)
I think that "mvm release:perform" in
dev/release/01-perform.sh does so but I don't know the
details of "mvm release:perform"...


More details:

dev/release/00-prepare.sh:

We'll be able to run this automatically when we can resolve
the above GPG key problem in our Java package. We can
resolve the Git commit problem by creating a cloned Git
repository.

dev/release/01-prepare.sh:

We'll be able to run this automatically when we can resolve
the above Git commit ("apche-arrow-${VERSION}" tag) problem
in our Java package.

dev/release/02-source.sh:

We'll be able to run this automatically by creating a GPG
key for nightly release test. We'll use Bintray to upload RC
source archive instead of dist.apache.org. Ah, we need a
Bintray API key for this. It must be secret.

dev/release/03-binary.sh:

We'll be able to run this automatically by creating a GPG
key for nightly release test. We need a Bintray API key too.

We need to improve this to support nightly release test. It
use "XXX-rc" such as "debian-rc" for Bintray "package" name.
It should use "XXX-nightly" such as "debian-nightly" for
nightly release test instead.

dev/release/post-00-release.sh:

We'll be able to skip this.

dev/release/post-01-upload.sh:

We'll be able to skip this.

dev/release/post-02-binary.sh:

We'll be able to run this automatically by creating Bintray
"packages" for nightly release and use them. We can create
"XXX-nightly-release" ("debian-nightly-release") Bintray
"packages" and use them instead of "XXX" ("debian") Bintray
"packages".

"debian" Bintray "package": https://bintray.com/apache/debian/

We need to improve this to support nightly release.

dev/release/post-03-website.sh:

We'll be able to run this automatically by creating a cloned
Git repository for test.

It's better that we have a Web site to show generated pages.
We can create
https://github.com/apache/arrow-site/tree/asf-site/nightly
and use it but I don't like it. Because arrow-site increases
a commit day by day.
Can we prepare a Web site for this? (arrow-nightly.ursalabs.org?)

dev/release/post-04-rubygems.sh:

We may be able to use GitHub Package Registry[2] to upload
RubyGems. We can use "pre-release" package feature of
https://rubygems.org/ but it's not suitable for
nightly. It's for RC or beta release.

[2] https://github.blog/2019-05-10-introducing-github-package-registry/

dev/release/post-05-js.sh:

We may be able to use GitHub Package Registry[2] to upload
npm packages.

dev/release/post-06-csharp.sh:

We may be able to use GitHub Package Registry[2] to upload
NuGet packages.

dev/release/post-07-rust.sh:

I don't have any idea. But it must be ran
automatically. It's always failed. I needed to run each
command manually.

dev/release/post-08-remove-rc.sh:

We'll be able to skip this.


Thanks,
--
kou

In 
  "Re: [DISCUSS] Release cadence and release vote conventions" on Wed, 31 Jul 
2019 15:35:57 -0500,
  Wes McKinney  wrote:

> The PMC member and their GPG keys need to be in the loop at some
> point. The release artifacts can be produced by some kind of CI/CD
> system so long as the PMC member has confidence in the security of
> those artifacts before signing them. For example, we build the
> official binary packages on public CI services and then download and
> sign them with Crossbow. I think the same could be done in theory with
> the source release but we'd first need to figure out what to do about
> the parts that create git commits.
> 
> On Wed, Jul 31, 2019 at 11:23 AM Andy Grove  wrote:
>>
>> To what extent would it be possible to automate the release process via
>> CICD?
>>
>> On Wed, Jul 31, 2019 at 9:19 AM Wes McKinney  wrote:
>>
>> > I think one thing that would help would be improving the
>> > reproducibility of the source release process. The RM has to have
>> > their machine configured in a particular way for it to work.
>> >
>> > Before anyone says "Docker" it isn't an easy solution because the
>> > release scripts need to be able to create git commits (created by the
>> > Maven release plugin) and sign artifacts using the RM's GPG keys.
>> >
>> > On Sat, Jul 27, 2019 at 10:04 PM Micah Kornfield 
>> > wrote:
>> > >
>> > > I just wanted to bump

Re: [DISCUSS] Release cadence and release vote conventions

2019-07-31 Thread Wes McKinney
The PMC member and their GPG keys need to be in the loop at some
point. The release artifacts can be produced by some kind of CI/CD
system so long as the PMC member has confidence in the security of
those artifacts before signing them. For example, we build the
official binary packages on public CI services and then download and
sign them with Crossbow. I think the same could be done in theory with
the source release but we'd first need to figure out what to do about
the parts that create git commits.

On Wed, Jul 31, 2019 at 11:23 AM Andy Grove  wrote:
>
> To what extent would it be possible to automate the release process via
> CICD?
>
> On Wed, Jul 31, 2019 at 9:19 AM Wes McKinney  wrote:
>
> > I think one thing that would help would be improving the
> > reproducibility of the source release process. The RM has to have
> > their machine configured in a particular way for it to work.
> >
> > Before anyone says "Docker" it isn't an easy solution because the
> > release scripts need to be able to create git commits (created by the
> > Maven release plugin) and sign artifacts using the RM's GPG keys.
> >
> > On Sat, Jul 27, 2019 at 10:04 PM Micah Kornfield 
> > wrote:
> > >
> > > I just wanted to bump this thread.  Kou and Krisztián as the last two
> > > release managers is there any specific infrastructure that you think
> > might
> > > have helped?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Wed, Jul 17, 2019 at 11:29 PM Micah Kornfield 
> > > wrote:
> > >
> > > > I'd can help as well, but not exactly sure where to start.  It seems
> > like
> > > > there are already some JIRAs opened [1]
> > > > for improving the release?  Could someone more familiar with the
> > process
> > > > pick out the highest priority ones? Do more need to be opened?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1]
> > > >
> > https://issues.apache.org/jira/browse/ARROW-2880?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(%22Developer%20Tools%22%2C%20Packaging)%20and%20summary%20~%20Release
> > > >
> > > > On Sat, Jul 13, 2019 at 7:17 AM Wes McKinney 
> > wrote:
> > > >
> > > >> To be effective at improving the life of release managers, the nightly
> > > >> release process really should use as close as possible to the same
> > > >> scripts that the RM uses to produce the release. Otherwise we could
> > > >> have a situation where the nightlies succeed but there is some problem
> > > >> that either fails an RC or is unable to be produced at all.
> > > >>
> > > >> On Sat, Jul 13, 2019 at 9:12 AM Andy Grove 
> > wrote:
> > > >> >
> > > >> > I would like to volunteer to help with Java and Rust release process
> > > >> work,
> > > >> > especially nightly releases.
> > > >> >
> > > >> > Although I'm not that familiar with the Java implementation of
> > Arrow, I
> > > >> > have been using Java and Maven for a very long time.
> > > >> >
> > > >> > Do we envisage a single nightly release process that releases all
> > > >> languages
> > > >> > simultaneously? or do we want separate process per language, with
> > > >> different
> > > >> > maintainers?
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Wed, Jul 10, 2019 at 8:18 AM Wes McKinney 
> > > >> wrote:
> > > >> >
> > > >> > > On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei 
> > > >> wrote:
> > > >> > > >
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > > in future releases we should
> > > >> > > > > institute a minimum 24-hour "quiet period" after any community
> > > >> > > > > feedback on a release candidate to allow issues to be examined
> > > >> > > > > further.
> > > >> > > >
> > > >> > > > I agree with this. I'll do so when I do a release manager in
> > > >> > > > the future.
> > > >> > > >
> > > >> > > > > To be able to release more often, two things have to happen:
> > > >> > > > >
> > > >> > > > > * More PMC members must engage with the release management
> > role,
> > > >> > > > > process, and tools
> > > >> > > > > * Continued improvements to release tooling to make the
> > process
> > > >> less
> > > >> > > > > painful for the release manager. For example, it seems we may
> > > >> want to
> > > >> > > > > find a different place than Bintray to host binary artifacts
> > > >> > > > > temporarily during release votes
> > > >> > > >
> > > >> > > > My opinion that we need to build nightly release system.
> > > >> > > >
> > > >> > > > It uses dev/release/NN-*.sh to build .tar.gz and binary
> > > >> > > > artifacts from the .tar.gz.
> > > >> > > > It also uses dev/release/verify-release-candidate.* to
> > > >> > > > verify build .tar.gz and binary artifacts.
> > > >> > > > It also uses dev/release/post-NN-*.sh to do post release
> > > >> > > > tasks. (Some tasks such as uploading a package to packaging
> > > >> > > > system will be dry-run.)
> > > >> > > >
> > > >> > >
> > > >> > > I agree that having a turn-key release system that's capable of
> > > >> > > producing nightly packages is the way to do. That 

Re: [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-07-31 Thread Bryan Cutler
+1 (non-binding)

On Wed, Jul 31, 2019 at 8:59 AM Uwe L. Korn  wrote:

> +1 from me.
>
> I really like the separate versions
>
> Uwe
>
> On Tue, Jul 30, 2019, at 2:21 PM, Antoine Pitrou wrote:
> >
> > +1 from me.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > On Fri, 26 Jul 2019 14:33:30 -0500
> > Wes McKinney  wrote:
> > > hello,
> > >
> > > As discussed on the mailing list thread [1], Micah Kornfield has
> > > proposed a version scheme for the project to take effect starting with
> > > the 1.0.0 release. See document [2] containing a discussion of the
> > > issues involved.
> > >
> > > To summarize my understanding of the plan:
> > >
> > > 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY
> > > versions. Currently there is only a single version number.
> > >
> > > 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to
> > > communicating library API changes. Given the project's pace of
> > > evolution, most releases are likely to be MAJOR releases according to
> > > SemVer principles.
> > >
> > > 3. RELEASES: Releases of the project will be named according to the
> > > LIBRARY version. A major release may or may not change the FORMAT
> > > version. When a LIBRARY version has been released for a new FORMAT
> > > version, the latter is considered to be released and official.
> > >
> > > 4. Each LIBRARY version will have a corresponding FORMAT version. For
> > > example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version
> > > 1.0.0. The idea is that FORMAT version will change less often than
> > > LIBRARY version.
> > >
> > > 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library
> > > will be able to read any data and metadata produced by an older client
> > > library.
> > >
> > > 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be
> > > able to either read data generated from a new client library or detect
> > > that it cannot properly read the data.
> > >
> > > 7. FORMAT MINOR VERSIONS: An increase in the minor version of the
> > > FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains
> > > new features not available in 1.0.0. So long as these features are not
> > > used (such as a new logical data type), forward compatibility is
> > > preserved.
> > >
> > > 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version
> > > indicates a disruption to these compatibility guarantees in some way.
> > > Hopefully we don't have to do this many times in our respective
> > > lifetimes
> > >
> > > If I've misrepresented some aspect of the proposal it's fine to
> > > discuss more and we can start a new votes.
> > >
> > > Please vote to approve this proposal. I'd like to keep this vote open
> > > for 7 days (until Friday August 2) to allow for ample opportunities
> > > for the community to have a look.
> > >
> > > [ ] +1 Adopt these version conventions and compatibility guarantees as
> > > of Apache Arrow 1.0.0
> > > [ ] +0
> > > [ ] -1 I disagree because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks
> > > Wes
> > >
> > > [1]:
> https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E
> > > [2]:
> https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
> > >
> >
> >
> >
> >
>


Re: Ursabot configuration within Arrow

2019-07-31 Thread Krisztián Szűcs
We can now reproduce the builds locally (without the need of
the web UI) with a single command:

To demonstrate, building the master barnch and building a pull
request requires the following commands:

$ ursabot project build 'AMD64 Ubuntu 18.04 C++'

$ ursabot project build -pr  'AMD64 Ubuntu 18.04 C++'

See the output here:
https://travis-ci.org/ursa-labs/ursabot/builds/566057077#L988

This effectively means, that the builders defined in ursabot
can be directly runned on machones or CI services which have
docker installed (with a single command).
It also replaces the need of the docker-compose setup.

I'm going to write some documentation and prepare the arrow
builders for a donation to the arrow codebase (which of course
requires a vote).

If anyone has a question please don't hesitate to ask!

Regards, Krisztian


On Tue, Jul 30, 2019 at 4:45 PM Krisztián Szűcs 
wrote:

> Ok, but the configuration movement to arrow is orthogonal to
> the local reproducibility feature. Could we proceed with that?
>
> On Tue, Jul 30, 2019 at 4:38 PM Wes McKinney  wrote:
>
>> I will defer to others to investigate this matter further but I would
>> really like to see a concrete and practical path to local
>> reproducibility before moving forward on any changes to our current
>> CI.
>>
>> On Tue, Jul 30, 2019 at 7:38 AM Krisztián Szűcs
>>  wrote:
>> >
>> > Fixed it and restarted a bunch of builds.
>> >
>> > On Tue, Jul 30, 2019 at 5:13 AM Wes McKinney 
>> wrote:
>> >
>> > > By the way, can you please disable the Buildbot builders that are
>> > > causing builds on master to fail? We haven't had a passing build in
>> > > over a week. Until we reconcile the build configurations we shouldn't
>> > > be failing contributors' builds
>> > >
>> > > On Mon, Jul 29, 2019 at 8:23 PM Wes McKinney 
>> wrote:
>> > > >
>> > > > On Mon, Jul 29, 2019 at 7:58 PM Krisztián Szűcs
>> > > >  wrote:
>> > > > >
>> > > > > On Tue, Jul 30, 2019 at 1:38 AM Wes McKinney > >
>> > > wrote:
>> > > > >
>> > > > > > hi Krisztian,
>> > > > > >
>> > > > > > Before talking about any code donations or where to run builds,
>> I
>> > > > > > think we first need to discuss the worrisome situation where we
>> have
>> > > > > > in some cases 3 (or more) CI configurations for different
>> components
>> > > > > > in the project.
>> > > > > >
>> > > > > > Just taking into account out C++ build, we have:
>> > > > > >
>> > > > > > * A config for Travis CI
>> > > > > > * Multiple configurations in Dockerfiles under cpp/
>> > > > > > * A brand new (?) configuration in this third party
>> ursa-labs/ursabot
>> > > > > > repository
>> > > > > >
>> > > > > > I note for example that the "AMD64 Conda C++" Buildbot build is
>> > > > > > failing while Travis CI is succeeding
>> > > > > >
>> > > > > > https://ci.ursalabs.org/#builders/66/builds/3196
>> > > > > >
>> > > > > > Starting from first principles, at least for Linux-based
>> builds, what
>> > > > > > I would like to see is:
>> > > > > >
>> > > > > > * A single build configuration (which can be driven by
>> yaml-based
>> > > > > > configuration files and environment variables), rather than 3
>> like we
>> > > > > > have now. This build configuration should be decoupled from any
>> CI
>> > > > > > platform, including Travis CI and Buildbot
>> > > > > >
>> > > > > Yeah, this would be the ideal setup, but I'm afraid the situation
>> is a
>> > > bit
>> > > > > more complicated.
>> > > > >
>> > > > > TravisCI
>> > > > > 
>> > > > >
>> > > > > constructed from a bunch of scripts optimized for travis, this
>> setup is
>> > > > > slow
>> > > > > and hardly compatible with any of the remaining setups.
>> > > > > I think we should ditch it.
>> > > > >
>> > > > > The "docker-compose setup"
>> > > > > --
>> > > > >
>> > > > > Most of the Dockerfiles are part of the  docker-compose setup
>> we've
>> > > > > developed.
>> > > > > This might be a good candidate as the tool to centralize around
>> our
>> > > future
>> > > > > setup, mostly because docker-compose is widely used, and we could
>> setup
>> > > > > buildbot builders (or any other CI's) to execute the sequence of
>> > > > > docker-compose
>> > > > > build and docker-compose run commands.
>> > > > > However docker-compose is not suitable for building and running
>> > > > > hierarchical
>> > > > > images. This is why we have added Makefile [1] to execute a
>> "build"
>> > > with a
>> > > > > single make command instead of manually executing multiple
>> commands
>> > > > > involving
>> > > > > multiple images (which is error prone). It can also leave a lot of
>> > > garbage
>> > > > > after both containers and images.
>> > > > > Docker-compose shines when one needs to orchestrate multiple
>> > > containers and
>> > > > > their networks / volumes on the same machine. We made it work
>> (with a
>> > > > > couple of
>> > > > > hacky workarounds) for arrow though.
>> > > > > Despite that, I still consider the docker-compose setup a good
>> > > solution,

[jira] [Created] (ARROW-6091) Implement parallel execution for limit

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6091:
-

 Summary: Implement parallel execution for limit
 Key: ARROW-6091
 URL: https://issues.apache.org/jira/browse/ARROW-6091
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6087) Implement parallel execution for CSV scan

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6087:
-

 Summary: Implement parallel execution for CSV scan
 Key: ARROW-6087
 URL: https://issues.apache.org/jira/browse/ARROW-6087
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6090) Implement parallel execution for hash aggregate

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6090:
-

 Summary: Implement parallel execution for hash aggregate
 Key: ARROW-6090
 URL: https://issues.apache.org/jira/browse/ARROW-6090
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6089) Implement parallel execution for selection

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6089:
-

 Summary: Implement parallel execution for selection
 Key: ARROW-6089
 URL: https://issues.apache.org/jira/browse/ARROW-6089
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6088) Implement parallel execution for projection

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6088:
-

 Summary: Implement parallel execution for projection
 Key: ARROW-6088
 URL: https://issues.apache.org/jira/browse/ARROW-6088
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6086) Implement parallel execution for parquet scan

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6086:
-

 Summary: Implement parallel execution for parquet scan
 Key: ARROW-6086
 URL: https://issues.apache.org/jira/browse/ARROW-6086
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: New version(s) on JIRA

2019-07-31 Thread Antoine Pitrou


Ok, I've created it as well.

Regards

Antoine.


Le 31/07/2019 à 19:00, Wes McKinney a écrit :
> Yes, I think we need 0.15.0 for this
> 
> On Wed, Jul 31, 2019 at 10:42 AM Antoine Pitrou  wrote:
>>
>>
>> Thanks.  I created "2.0.0".
>> Will we also need a "0.15.0" for the flatbuffers alignment fix?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 31/07/2019 à 03:00, Sutou Kouhei a écrit :
>>> Hi,
>>>
>>> I think that "2.0.0" is better. Because we'll not release
>>> "1.1.0".
>>>
>>> See also: 
>>> https://lists.apache.org/thread.html/d0ab931b15e75f745f8ae5a348f6c26a3e1f0bb98dc38a9a2c9888d3@%3Cdev.arrow.apache.org%3E
>>>
>>>
>>> Thanks,
>>> --
>>> kou
>>>
>>> In 
>>>   "New version(s) on JIRA" on Tue, 30 Jul 2019 19:20:00 +0200,
>>>   Antoine Pitrou  wrote:
>>>

 Hi,

 Should we create a "1.1.0" (and/or "2.0.0") version on JIRA, to start
 assigning non-urgent issues?

 Regards

 Antoine.



Re: Building on Arrow CUDA

2019-07-31 Thread Uwe L. Korn
Hello Paul,

you might want to look into 
https://github.com/conda-forge/conda-forge.github.io/issues/687 where CUDA 
support on conda-forge is dicussed. I'm not uptodate anymore on this but 
reading the whole issue should give you the current level of support. Once this 
is solved, adding cuda support to the Arrow packages on conda-forge should be 
really simple (but this issue is the major hurdle).

Cheers
Uwe

On Thu, Jul 25, 2019, at 3:54 PM, Wes McKinney wrote:
> hi Paul,
> 
> On Wed, Jul 24, 2019 at 3:07 PM Paul Taylor  wrote:
> >
> > I'm looking at options to replace the custom Arrow logic in cuDF with
> > Arrow library calls. What's the recommended way to declare a dependency
> > on pyarrow/arrowcpp with CUDA support?
> >
> 
> Well, for conda or wheel packages, we are not shipping with the CUDA
> extensions enabled yet. So if you want to depend on one of those, you
> will have to change that. My understanding is that it's possible to
> build CUDA-enabled packages in conda-forge -- that would probably be
> your best bet. Does anyone know examples of such packages that are
> CUDA-enabled?
> 
> > I see in the docs it says to build from source, but that's only an
> > option for an (advanced) end-user. And building/vendoring
> > libarrow_cuda.so isn't a great option for a non-Arrow library, because
> > someone who does source build Arrow-with-cuda will conflict with the
> > version we ship.
> >
> > Right now we're considering statically linking libarrow_cuda into
> > libcudf.so and vendoring Arrow's cuda cython alongside ours, but this
> > increases compile times/library size.
> >
> > Is there a package management solution (like pip/conda install
> > pyarrow[cuda]) that I'm missing? If not, should there be?
> >
> 
> You can submit pull requests to
> 
> * https://github.com/conda-forge/arrow-cpp-feedstock
> * https://github.com/conda-forge/pyarrow-feedstock
> 
> conda-forge itself can provide guidance at
> https://gitter.im/conda-forge/conda-forge.github.io
> 
> > Best,
> >
> > Paul
> >
>


Re: New version(s) on JIRA

2019-07-31 Thread Wes McKinney
Yes, I think we need 0.15.0 for this

On Wed, Jul 31, 2019 at 10:42 AM Antoine Pitrou  wrote:
>
>
> Thanks.  I created "2.0.0".
> Will we also need a "0.15.0" for the flatbuffers alignment fix?
>
> Regards
>
> Antoine.
>
>
> Le 31/07/2019 à 03:00, Sutou Kouhei a écrit :
> > Hi,
> >
> > I think that "2.0.0" is better. Because we'll not release
> > "1.1.0".
> >
> > See also: 
> > https://lists.apache.org/thread.html/d0ab931b15e75f745f8ae5a348f6c26a3e1f0bb98dc38a9a2c9888d3@%3Cdev.arrow.apache.org%3E
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "New version(s) on JIRA" on Tue, 30 Jul 2019 19:20:00 +0200,
> >   Antoine Pitrou  wrote:
> >
> >>
> >> Hi,
> >>
> >> Should we create a "1.1.0" (and/or "2.0.0") version on JIRA, to start
> >> assigning non-urgent issues?
> >>
> >> Regards
> >>
> >> Antoine.
> >>


Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

2019-07-31 Thread Brian Hulette
I'm a little confused about the proposal now. If the unknown dimension
doesn't have to be the same within a record batch, how would you be able to
deduce it with the approach you described (dividing the logical length of
the values array by the length of the record batch)?

On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney  wrote:

> I agree this sounds like a good application for ExtensionType. At
> minimum, ExtensionType can be used to develop a working version of
> what you need to help guide further discussions.
>
> On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
>  wrote:
> >
> > Hello,
> >
> > if each record has a different size, then I suggest to just use a
> > Struct> where Dim is a struct (or expand in the outer
> > struct). You can probably add your own logic with the recently
> > introduced ExtensionType [1].
> >
> > François
> > [1]
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> >
> > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper 
> wrote:
> > >
> > > The intention is that each individual record could have a different
> size.
> > > This could be consistent within a given batch, but wouldn't need to be.
> > > For example, if I wanted to send a 3-channel image, but the image size
> may
> > > vary for each record, then I could use
> > > FixedSizeList[3]>[-1]>[-1].
> > >
> > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette 
> wrote:
> > >
> > > > This isn't really relevant but I feel compelled to point it out - the
> > > > FixedSizeList type has actually been in the Arrow spec for a while,
> but it
> > > > was only implemented in JS and Java initially. It was implemented in
> C++
> > > > just a few months ago.
> > > >
> > >
> > > Thanks for the clarification -- I was going based on the blame history
> for
> > > Layout.rst, but I guess it just didn't get officially documented there
> > > until the c++ implementation was added.
> > >
> > > -Edward
> > >
> > >
> > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> 
> > > > wrote:
> > > >
> > > > > The FixedSizeList type, which was added to Arrow a few months ago,
> is an
> > > > > array where each slot contains a fixed-size sequence of values.
> It is
> > > > > specified as FixedSizeList[N], where T is a child type and N is
> a
> > > > signed
> > > > > int32 that specifies the length of each list.
> > > > >
> > > > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > > > 100x8x10
> > > > > tensor, then I can encode it as
> > > > > FixedSizeList[10]>[8]>[100].
> > > > >
> > > > > But I'm also interested in encoding tensors where some dimension
> sizes
> > > > are
> > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > extended
> > > > > to support this fairly easily, by simply defining that N=-1 means
> "each
> > > > > array slot has the same length, but that length is not known in
> advance."
> > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > FixedSizeList[10]>[-1]>[100].
> > > > >
> > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> some
> > > > way
> > > > > to determine what they are.  Luckily, every Field in the schema
> has a
> > > > > corresponding FieldNode in the message; and those FieldNodes can
> be used
> > > > to
> > > > > deduce the row lengths.  In particular, the row length must be
> equal to
> > > > the
> > > > > length of the child node divided by the length of the
> FixedSizeList.
> > > > E.g.,
> > > > > if we have a FixedSizeList[-1] array with the values [[1,
> 2], [3,
> > > > 4],
> > > > > [5, 6]] then the message representation is:
> > > > >
> > > > > * Length: 3, Null count: 0
> > > > > * Null bitmap buffer: Not required
> > > > > * Values array (byte array):
> > > > > * Length: 6,  Null count: 0
> > > > > * Null bitmap buffer: Not required
> > > > > * Value buffer: [1, 2, 3, 4, 5, 6, ]
> > > > >
> > > > > So we can deduce that the row length is 6/3=2.
> > > > >
> > > > > It looks to me like it would be fairly easy to add support for
> this.
> > > > E.g.,
> > > > > in the FixedSizeListArray constructor in c++, if
> list_type()->list_size()
> > > > > is -1, then set list_size_ to values.length()/length.  There would
> be no
> > > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > > assigning a
> > > > > meaning to something that's currently meaningless (having
> > > > > FixedSizeList.listSize=-1).
> > > > >
> > > > > If there's support for adding this to Arrow, then I could put
> together a
> > > > > PR.
> > > > >
> > > > > Thanks,
> > > > > -Edward
> > > > >
> > > > > P.S. Apologies if this gets posted twice -- I sent it out a couple
> days
> > > > ago
> > > > > right before subscribing to the mailing list; but I don't see it
> on the
> > > > > archives, presumably because I wasn't subscribed yet when I sent
> it out.
> > > > >
> > > >
>


[jira] [Created] (ARROW-6085) Create traits for phsyical query plan

2019-07-31 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6085:
-

 Summary: Create traits for phsyical query plan
 Key: ARROW-6085
 URL: https://issues.apache.org/jira/browse/ARROW-6085
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6084) [Python] Support LargeList

2019-07-31 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6084:
-

 Summary: [Python] Support LargeList
 Key: ARROW-6084
 URL: https://issues.apache.org/jira/browse/ARROW-6084
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Antoine Pitrou






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Release cadence and release vote conventions

2019-07-31 Thread Andy Grove
To what extent would it be possible to automate the release process via
CICD?

On Wed, Jul 31, 2019 at 9:19 AM Wes McKinney  wrote:

> I think one thing that would help would be improving the
> reproducibility of the source release process. The RM has to have
> their machine configured in a particular way for it to work.
>
> Before anyone says "Docker" it isn't an easy solution because the
> release scripts need to be able to create git commits (created by the
> Maven release plugin) and sign artifacts using the RM's GPG keys.
>
> On Sat, Jul 27, 2019 at 10:04 PM Micah Kornfield 
> wrote:
> >
> > I just wanted to bump this thread.  Kou and Krisztián as the last two
> > release managers is there any specific infrastructure that you think
> might
> > have helped?
> >
> > Thanks,
> > Micah
> >
> > On Wed, Jul 17, 2019 at 11:29 PM Micah Kornfield 
> > wrote:
> >
> > > I'd can help as well, but not exactly sure where to start.  It seems
> like
> > > there are already some JIRAs opened [1]
> > > for improving the release?  Could someone more familiar with the
> process
> > > pick out the highest priority ones? Do more need to be opened?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> https://issues.apache.org/jira/browse/ARROW-2880?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(%22Developer%20Tools%22%2C%20Packaging)%20and%20summary%20~%20Release
> > >
> > > On Sat, Jul 13, 2019 at 7:17 AM Wes McKinney 
> wrote:
> > >
> > >> To be effective at improving the life of release managers, the nightly
> > >> release process really should use as close as possible to the same
> > >> scripts that the RM uses to produce the release. Otherwise we could
> > >> have a situation where the nightlies succeed but there is some problem
> > >> that either fails an RC or is unable to be produced at all.
> > >>
> > >> On Sat, Jul 13, 2019 at 9:12 AM Andy Grove 
> wrote:
> > >> >
> > >> > I would like to volunteer to help with Java and Rust release process
> > >> work,
> > >> > especially nightly releases.
> > >> >
> > >> > Although I'm not that familiar with the Java implementation of
> Arrow, I
> > >> > have been using Java and Maven for a very long time.
> > >> >
> > >> > Do we envisage a single nightly release process that releases all
> > >> languages
> > >> > simultaneously? or do we want separate process per language, with
> > >> different
> > >> > maintainers?
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jul 10, 2019 at 8:18 AM Wes McKinney 
> > >> wrote:
> > >> >
> > >> > > On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei 
> > >> wrote:
> > >> > > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > > in future releases we should
> > >> > > > > institute a minimum 24-hour "quiet period" after any community
> > >> > > > > feedback on a release candidate to allow issues to be examined
> > >> > > > > further.
> > >> > > >
> > >> > > > I agree with this. I'll do so when I do a release manager in
> > >> > > > the future.
> > >> > > >
> > >> > > > > To be able to release more often, two things have to happen:
> > >> > > > >
> > >> > > > > * More PMC members must engage with the release management
> role,
> > >> > > > > process, and tools
> > >> > > > > * Continued improvements to release tooling to make the
> process
> > >> less
> > >> > > > > painful for the release manager. For example, it seems we may
> > >> want to
> > >> > > > > find a different place than Bintray to host binary artifacts
> > >> > > > > temporarily during release votes
> > >> > > >
> > >> > > > My opinion that we need to build nightly release system.
> > >> > > >
> > >> > > > It uses dev/release/NN-*.sh to build .tar.gz and binary
> > >> > > > artifacts from the .tar.gz.
> > >> > > > It also uses dev/release/verify-release-candidate.* to
> > >> > > > verify build .tar.gz and binary artifacts.
> > >> > > > It also uses dev/release/post-NN-*.sh to do post release
> > >> > > > tasks. (Some tasks such as uploading a package to packaging
> > >> > > > system will be dry-run.)
> > >> > > >
> > >> > >
> > >> > > I agree that having a turn-key release system that's capable of
> > >> > > producing nightly packages is the way to do. That way any problems
> > >> > > that would block a release will come up as they happen rather than
> > >> > > piling up until the very end like they are now.
> > >> > >
> > >> > > > I needed 10 or more changes for dev/release/ to create
> > >> > > > 0.14.0 RC0. (Some of them are still in my local stashes. I
> > >> > > > don't have time to create pull requests for them
> > >> > > > yet. Because I postponed some tasks of my main
> > >> > > > business. I'll create pull requests after I finished the
> > >> > > > postponed tasks of my main business.)
> > >> > > >
> > >> > >
> > >> > > Thanks. I'll follow up on the 0.14.1/0.15.0 thread -- since we
> need to
> > >> > > release again soon because of problems with 0.14.0 please let us
> know
> > >> > > what patches will be needed to m

Re: [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-07-31 Thread Uwe L. Korn
+1 from me.

I really like the separate versions

Uwe

On Tue, Jul 30, 2019, at 2:21 PM, Antoine Pitrou wrote:
> 
> +1 from me.
> 
> Regards
> 
> Antoine.
> 
> 
> 
> On Fri, 26 Jul 2019 14:33:30 -0500
> Wes McKinney  wrote:
> > hello,
> > 
> > As discussed on the mailing list thread [1], Micah Kornfield has
> > proposed a version scheme for the project to take effect starting with
> > the 1.0.0 release. See document [2] containing a discussion of the
> > issues involved.
> > 
> > To summarize my understanding of the plan:
> > 
> > 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY
> > versions. Currently there is only a single version number.
> > 
> > 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to
> > communicating library API changes. Given the project's pace of
> > evolution, most releases are likely to be MAJOR releases according to
> > SemVer principles.
> > 
> > 3. RELEASES: Releases of the project will be named according to the
> > LIBRARY version. A major release may or may not change the FORMAT
> > version. When a LIBRARY version has been released for a new FORMAT
> > version, the latter is considered to be released and official.
> > 
> > 4. Each LIBRARY version will have a corresponding FORMAT version. For
> > example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version
> > 1.0.0. The idea is that FORMAT version will change less often than
> > LIBRARY version.
> > 
> > 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library
> > will be able to read any data and metadata produced by an older client
> > library.
> > 
> > 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be
> > able to either read data generated from a new client library or detect
> > that it cannot properly read the data.
> > 
> > 7. FORMAT MINOR VERSIONS: An increase in the minor version of the
> > FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains
> > new features not available in 1.0.0. So long as these features are not
> > used (such as a new logical data type), forward compatibility is
> > preserved.
> > 
> > 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version
> > indicates a disruption to these compatibility guarantees in some way.
> > Hopefully we don't have to do this many times in our respective
> > lifetimes
> > 
> > If I've misrepresented some aspect of the proposal it's fine to
> > discuss more and we can start a new votes.
> > 
> > Please vote to approve this proposal. I'd like to keep this vote open
> > for 7 days (until Friday August 2) to allow for ample opportunities
> > for the community to have a look.
> > 
> > [ ] +1 Adopt these version conventions and compatibility guarantees as
> > of Apache Arrow 1.0.0
> > [ ] +0
> > [ ] -1 I disagree because...
> > 
> > Here is my vote: +1
> > 
> > Thanks
> > Wes
> > 
> > [1]: 
> > https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E
> > [2]: 
> > https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
> > 
> 
> 
> 
>


Re: New version(s) on JIRA

2019-07-31 Thread Antoine Pitrou


Thanks.  I created "2.0.0".
Will we also need a "0.15.0" for the flatbuffers alignment fix?

Regards

Antoine.


Le 31/07/2019 à 03:00, Sutou Kouhei a écrit :
> Hi,
> 
> I think that "2.0.0" is better. Because we'll not release
> "1.1.0".
> 
> See also: 
> https://lists.apache.org/thread.html/d0ab931b15e75f745f8ae5a348f6c26a3e1f0bb98dc38a9a2c9888d3@%3Cdev.arrow.apache.org%3E
> 
> 
> Thanks,
> --
> kou
> 
> In 
>   "New version(s) on JIRA" on Tue, 30 Jul 2019 19:20:00 +0200,
>   Antoine Pitrou  wrote:
> 
>>
>> Hi,
>>
>> Should we create a "1.1.0" (and/or "2.0.0") version on JIRA, to start
>> assigning non-urgent issues?
>>
>> Regards
>>
>> Antoine.
>>


Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

2019-07-31 Thread Wes McKinney
I agree this sounds like a good application for ExtensionType. At
minimum, ExtensionType can be used to develop a working version of
what you need to help guide further discussions.

On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
 wrote:
>
> Hello,
>
> if each record has a different size, then I suggest to just use a
> Struct> where Dim is a struct (or expand in the outer
> struct). You can probably add your own logic with the recently
> introduced ExtensionType [1].
>
> François
> [1] 
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
>
> On Mon, Jul 29, 2019 at 3:15 PM Edward Loper  
> wrote:
> >
> > The intention is that each individual record could have a different size.
> > This could be consistent within a given batch, but wouldn't need to be.
> > For example, if I wanted to send a 3-channel image, but the image size may
> > vary for each record, then I could use
> > FixedSizeList[3]>[-1]>[-1].
> >
> > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette  wrote:
> >
> > > This isn't really relevant but I feel compelled to point it out - the
> > > FixedSizeList type has actually been in the Arrow spec for a while, but it
> > > was only implemented in JS and Java initially. It was implemented in C++
> > > just a few months ago.
> > >
> >
> > Thanks for the clarification -- I was going based on the blame history for
> > Layout.rst, but I guess it just didn't get officially documented there
> > until the c++ implementation was added.
> >
> > -Edward
> >
> >
> > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper 
> > > wrote:
> > >
> > > > The FixedSizeList type, which was added to Arrow a few months ago, is an
> > > > array where each slot contains a fixed-size sequence of values.  It is
> > > > specified as FixedSizeList[N], where T is a child type and N is a
> > > signed
> > > > int32 that specifies the length of each list.
> > > >
> > > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > > 100x8x10
> > > > tensor, then I can encode it as
> > > > FixedSizeList[10]>[8]>[100].
> > > >
> > > > But I'm also interested in encoding tensors where some dimension sizes
> > > are
> > > > not known in advance.  It seems to me that FixedSizeList could be
> > > extended
> > > > to support this fairly easily, by simply defining that N=-1 means "each
> > > > array slot has the same length, but that length is not known in 
> > > > advance."
> > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > FixedSizeList[10]>[-1]>[100].
> > > >
> > > > Since these N=-1 row-lengths are not encoded in the type, we need some
> > > way
> > > > to determine what they are.  Luckily, every Field in the schema has a
> > > > corresponding FieldNode in the message; and those FieldNodes can be used
> > > to
> > > > deduce the row lengths.  In particular, the row length must be equal to
> > > the
> > > > length of the child node divided by the length of the FixedSizeList.
> > > E.g.,
> > > > if we have a FixedSizeList[-1] array with the values [[1, 2], [3,
> > > 4],
> > > > [5, 6]] then the message representation is:
> > > >
> > > > * Length: 3, Null count: 0
> > > > * Null bitmap buffer: Not required
> > > > * Values array (byte array):
> > > > * Length: 6,  Null count: 0
> > > > * Null bitmap buffer: Not required
> > > > * Value buffer: [1, 2, 3, 4, 5, 6, ]
> > > >
> > > > So we can deduce that the row length is 6/3=2.
> > > >
> > > > It looks to me like it would be fairly easy to add support for this.
> > > E.g.,
> > > > in the FixedSizeListArray constructor in c++, if 
> > > > list_type()->list_size()
> > > > is -1, then set list_size_ to values.length()/length.  There would be no
> > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > assigning a
> > > > meaning to something that's currently meaningless (having
> > > > FixedSizeList.listSize=-1).
> > > >
> > > > If there's support for adding this to Arrow, then I could put together a
> > > > PR.
> > > >
> > > > Thanks,
> > > > -Edward
> > > >
> > > > P.S. Apologies if this gets posted twice -- I sent it out a couple days
> > > ago
> > > > right before subscribing to the mailing list; but I don't see it on the
> > > > archives, presumably because I wasn't subscribed yet when I sent it out.
> > > >
> > >


Re: [DISCUSS] Release cadence and release vote conventions

2019-07-31 Thread Wes McKinney
I think one thing that would help would be improving the
reproducibility of the source release process. The RM has to have
their machine configured in a particular way for it to work.

Before anyone says "Docker" it isn't an easy solution because the
release scripts need to be able to create git commits (created by the
Maven release plugin) and sign artifacts using the RM's GPG keys.

On Sat, Jul 27, 2019 at 10:04 PM Micah Kornfield  wrote:
>
> I just wanted to bump this thread.  Kou and Krisztián as the last two
> release managers is there any specific infrastructure that you think might
> have helped?
>
> Thanks,
> Micah
>
> On Wed, Jul 17, 2019 at 11:29 PM Micah Kornfield 
> wrote:
>
> > I'd can help as well, but not exactly sure where to start.  It seems like
> > there are already some JIRAs opened [1]
> > for improving the release?  Could someone more familiar with the process
> > pick out the highest priority ones? Do more need to be opened?
> >
> > Thanks,
> > Micah
> >
> > [1]
> > https://issues.apache.org/jira/browse/ARROW-2880?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(%22Developer%20Tools%22%2C%20Packaging)%20and%20summary%20~%20Release
> >
> > On Sat, Jul 13, 2019 at 7:17 AM Wes McKinney  wrote:
> >
> >> To be effective at improving the life of release managers, the nightly
> >> release process really should use as close as possible to the same
> >> scripts that the RM uses to produce the release. Otherwise we could
> >> have a situation where the nightlies succeed but there is some problem
> >> that either fails an RC or is unable to be produced at all.
> >>
> >> On Sat, Jul 13, 2019 at 9:12 AM Andy Grove  wrote:
> >> >
> >> > I would like to volunteer to help with Java and Rust release process
> >> work,
> >> > especially nightly releases.
> >> >
> >> > Although I'm not that familiar with the Java implementation of Arrow, I
> >> > have been using Java and Maven for a very long time.
> >> >
> >> > Do we envisage a single nightly release process that releases all
> >> languages
> >> > simultaneously? or do we want separate process per language, with
> >> different
> >> > maintainers?
> >> >
> >> >
> >> >
> >> > On Wed, Jul 10, 2019 at 8:18 AM Wes McKinney 
> >> wrote:
> >> >
> >> > > On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei 
> >> wrote:
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > > in future releases we should
> >> > > > > institute a minimum 24-hour "quiet period" after any community
> >> > > > > feedback on a release candidate to allow issues to be examined
> >> > > > > further.
> >> > > >
> >> > > > I agree with this. I'll do so when I do a release manager in
> >> > > > the future.
> >> > > >
> >> > > > > To be able to release more often, two things have to happen:
> >> > > > >
> >> > > > > * More PMC members must engage with the release management role,
> >> > > > > process, and tools
> >> > > > > * Continued improvements to release tooling to make the process
> >> less
> >> > > > > painful for the release manager. For example, it seems we may
> >> want to
> >> > > > > find a different place than Bintray to host binary artifacts
> >> > > > > temporarily during release votes
> >> > > >
> >> > > > My opinion that we need to build nightly release system.
> >> > > >
> >> > > > It uses dev/release/NN-*.sh to build .tar.gz and binary
> >> > > > artifacts from the .tar.gz.
> >> > > > It also uses dev/release/verify-release-candidate.* to
> >> > > > verify build .tar.gz and binary artifacts.
> >> > > > It also uses dev/release/post-NN-*.sh to do post release
> >> > > > tasks. (Some tasks such as uploading a package to packaging
> >> > > > system will be dry-run.)
> >> > > >
> >> > >
> >> > > I agree that having a turn-key release system that's capable of
> >> > > producing nightly packages is the way to do. That way any problems
> >> > > that would block a release will come up as they happen rather than
> >> > > piling up until the very end like they are now.
> >> > >
> >> > > > I needed 10 or more changes for dev/release/ to create
> >> > > > 0.14.0 RC0. (Some of them are still in my local stashes. I
> >> > > > don't have time to create pull requests for them
> >> > > > yet. Because I postponed some tasks of my main
> >> > > > business. I'll create pull requests after I finished the
> >> > > > postponed tasks of my main business.)
> >> > > >
> >> > >
> >> > > Thanks. I'll follow up on the 0.14.1/0.15.0 thread -- since we need to
> >> > > release again soon because of problems with 0.14.0 please let us know
> >> > > what patches will be needed to make another release.
> >> > >
> >> > > > If we fix problems related to dev/release/ in our normal
> >> > > > development process, release process will be less painful.
> >> > > >
> >> > > > The biggest problem for 0.14.0 RC0 is java/pom.xml related:
> >> > > >   https://github.com/apache/arrow/pull/4717
> >> > > >
> >> > > > It was difficult for me because I don't have Jav

[jira] [Created] (ARROW-6083) [Java] Refactor Jdbc adapter consume logic

2019-07-31 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6083:
-

 Summary: [Java] Refactor Jdbc adapter consume logic
 Key: ARROW-6083
 URL: https://issues.apache.org/jira/browse/ARROW-6083
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Jdbc adapter read from {{ResultSet}} looks like:

while (rs.next()) {
 for (int i = 1; i <= columnCount; i++) {
 jdbcToFieldVector(
 rs,
 i,
 rs.getMetaData().getColumnType(i),
 rowCount,
 root.getVector(rsmd.getColumnName(i)),
 config);
 }
 rowCount++;
}

And in {{jdbcToFieldVector}} has lots of switch-case, that is to see, for every 
single value from ResultSet we have to do lots of analyzing conditions.

I think we could optimize this using consumer/delegate like avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6082) [Python] create pa.dictionary() type with non-integer indices type crashes

2019-07-31 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6082:


 Summary: [Python] create pa.dictionary() type with non-integer 
indices type crashes
 Key: ARROW-6082
 URL: https://issues.apache.org/jira/browse/ARROW-6082
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


For example if you mixed the order of the indices and values type:

{code}
In [1]: pa.dictionary(pa.int8(), pa.string())   

   
Out[1]: DictionaryType(dictionary)

In [2]: pa.dictionary(pa.string(), pa.int8())   

   
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0731 14:40:42.748589 26310 type.cc:440]  Check failed: 
is_integer(index_type->id()) dictionary index type should be signed integer
*** Check failure stack trace: ***
Aborted (core dumped)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6081) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmptb2ao6te_job_6e0a8ca1.parquet'

2019-07-31 Thread David Draper (JIRA)
David Draper created ARROW-6081:
---

 Summary: FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/tmptb2ao6te_job_6e0a8ca1.parquet'
 Key: ARROW-6081
 URL: https://issues.apache.org/jira/browse/ARROW-6081
 Project: Apache Arrow
  Issue Type: Bug
Reporter: David Draper


Any idea on how to fix this error? 

 

Traceback (most recent call last):
 File "/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/client.py", 
line 1530, in load_table_from_dataframe
 dataframe.to_parquet(tmppath)
 File "/usr/local/lib64/python3.6/site-packages/pandas/core/frame.py", line 
2203, in to_parquet
 partition_cols=partition_cols, **kwargs)
 File "/usr/local/lib64/python3.6/site-packages/pandas/io/parquet.py", line 
252, in to_parquet
 partition_cols=partition_cols, **kwargs)
 File "/usr/local/lib64/python3.6/site-packages/pandas/io/parquet.py", line 
122, in write
 coerce_timestamps=coerce_timestamps, **kwargs)
 File "/usr/local/lib64/python3.6/site-packages/pyarrow/parquet.py", line 1270, 
in write_table
 writer.write_table(table, row_group_size=row_group_size)
 File "/usr/local/lib64/python3.6/site-packages/pyarrow/parquet.py", line 426, 
in write_table
 self.writer.write_table(table, row_group_size=row_group_size)
 File "pyarrow/_parquet.pyx", line 1311, in 
pyarrow._parquet.ParquetWriter.write_table
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/var/cache/tomcat/temp/interpreter-2169813765840716657.tmp", line 84, in 

 client.load_table_from_dataframe(appended_data, 
table_ref,job_config=job_config).result()
 File "/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/client.py", 
line 1546, in load_table_from_dataframe
 os.remove(tmppath)
FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/tmptb2ao6te_job_6e0a8ca1.parquet'



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6080) [Java] Support search operation for BaseRepeatedValueVector

2019-07-31 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6080:
---

 Summary: [Java] Support search operation for 
BaseRepeatedValueVector
 Key: ARROW-6080
 URL: https://issues.apache.org/jira/browse/ARROW-6080
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6079) [Java] Implement/test UnionFixedSizeListWriter for FixedSizeListVector

2019-07-31 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6079:
-

 Summary: [Java] Implement/test UnionFixedSizeListWriter for 
FixedSizeListVector
 Key: ARROW-6079
 URL: https://issues.apache.org/jira/browse/ARROW-6079
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now we have two list vectors: {{ListVector}} and {{FixedSizeListVector}}.

{{ListVector}} has already implemented UnionListWriter for writing data, 
however, {{FixedSizeListVector}} doesn't have this yet and seems the only way 
for users to write data is getting inner vector and set value manually.

Implement a writer for {{FixedSizeListVector}} is useful in some cases.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)