Re: [VOTE][Julia] Release Apache Arrow Julia 2.7.2 RC1

2024-05-06 Thread Jacob Quinn
+1, tested on m3 macos

On Mon, May 6, 2024 at 4:11 PM Sutou Kouhei  wrote:

> Hi,
>
> Note that we already published this version to the official
> registry of general Julia packages[1] accidentally[2] but I
> would like to start a vote for this version to satisfy the
> ASF's release policy[3].
>
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.7.2.
>
> This release candidate is based on commit:
> 64fc730f767de84835a5f1b4fc9b7831a3c2d15b [4]
>
> The source release rc1 is hosted at [5].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [6] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.7.2
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.7.2 because...
>
> [1]: https://github.com/JuliaRegistries/General/pull/106211
> [2]:
> https://github.com/apache/arrow-julia/commit/64fc730f767de84835a5f1b4fc9b7831a3c2d15b#commitcomment-141695334
> [3]: https://www.apache.org/legal/release-policy.html
> [4]:
> https://github.com/apache/arrow-julia/tree/64fc730f767de84835a5f1b4fc9b7831a3c2d15b
> [5]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.7.2-rc1/
> [6]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>
>
> Thanks,
> --
> kou
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.7.1 RC1

2024-01-31 Thread Jacob Quinn
+1, tested on macos.

-Jacob

On Wed, Jan 31, 2024 at 10:11 AM Ben Baumgold  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.7.1.
>
> This release candidate is based on commit:
> ac199b0e377502ea0f1fa5ced7fda897a01b82a9 [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.7.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.7.1 because...
>
> [1]:
>
> https://github.com/apache/arrow-julia/tree/ac199b0e377502ea0f1fa5ced7fda897a01b82a9
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.7.1-rc1/
> [3]:
>
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.7.0 RC1

2023-12-08 Thread Jacob Quinn
+1

Tested on macos m3 with Julia 1.10-rc2

-Jacob

On Fri, Dec 8, 2023 at 7:08 PM Dewey Dunnington
 wrote:

> +1
>
> I ran
>
> export PATH="/Applications/
> Julia-1.9.app/Contents/Resources/julia/bin:$PATH"
> dev/release/verify_rc.sh 2.7.0 1
>
> ...on MacOS M1 Ventura
>
> On Tue, Dec 5, 2023 at 4:38 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.7.0.
> >
> > This release candidate is based on commit:
> > 37122911c24f44318e6d4a0840408adb3364cf2a [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.7.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.7.0 because...
> >
> > [1]:
> https://github.com/apache/arrow-julia/tree/37122911c24f44318e6d4a0840408adb3364cf2a
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.7.0-rc1/
> > [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
> >
> >
> > Thanks,
> > --
> > kou
>


Re: [Vote][Format] C Data Interface Format string for REE

2023-08-16 Thread Jacob Quinn
+1 (binding)

Cheers,

-Jacob

On Wed, Aug 16, 2023 at 8:16 AM Matt Topol 
wrote:

> Hey All,
>
> As proposed by Felipe [1] I'm starting a vote on the proposed update to the
> Format Spec of adding "+r" as the format string for passing Run-End Encoded
> arrays through the Arrow C Data Interface.
>
> A PR containing an update to the C++ Arrow implementation to add support
> for this format string along with documentation updates can be found here
> [2].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - I'm in favor of this new C Data Format string
> [ ] +0
> [ ] -1 - I'm against adding this new format string because
>
> Thanks everyone!
>
> --Matt
>
> [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781
> [2]: https://github.com/apache/arrow/pull/37174
>


Re: [ANNOUNCE] New Arrow PMC member: Ben Baumgold,

2023-06-20 Thread Jacob Quinn
Yay! Congrats Ben! Love to see more Julia folks here!

-Jacob

On Tue, Jun 20, 2023 at 4:15 AM Andrew Lamb  wrote:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Ben Baumgold, to become a PMC member and we are pleased to announce
> that Ben Baumgold has accepted.
>
> Congratulations and welcome!
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.6.2 RC1

2023-06-09 Thread Jacob Quinn
+1 (macOS m1)

-Jacob

On Fri, Jun 9, 2023 at 1:41 PM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.6.2.
>
> This release candidate is based on commit:
> 9f1d51a2c975bd83cbaf70c5f640762c6a0bccaf [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.6.2
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.6.2 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/9f1d51a2c975bd83cbaf70c5f640762c6a0bccaf
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.6.2-rc1/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.6.1 RC1

2023-06-06 Thread Jacob Quinn
+1 (macOS M1)

Cheers,

-Jacob

On Tue, Jun 6, 2023 at 7:48 PM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.6.1.
>
> This release candidate is based on commit:
> 2d1114e180ef11f9d3bbe310b2eb856550cfbeb3 [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.6.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.6.1 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/2d1114e180ef11f9d3bbe310b2eb856550cfbeb3
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.6.1-rc1/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.6.0 RC1

2023-06-04 Thread Jacob Quinn
+1

Tested on apple m1

-Jacob


On Sat, Jun 3, 2023 at 3:27 PM Sutou Kouhei  wrote:

> +1
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   VERIFY_FORCE_USE_JULIA_BINARY=1 dev/release/verify_rc.sh 2.6.0 1
>
>
> Thanks,
> --
> kou
>
> In <20230604.072246.1693870468835902730@clear-code.com>
>   "[VOTE][Julia] Release Apache Arrow Julia 2.6.0 RC1" on Sun, 04 Jun 2023
> 07:22:46 +0900 (JST),
>   Sutou Kouhei  wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.6.0.
> >
> > This release candidate is based on commit:
> > 771db0a31685e6b12e8b576685b7d8d5c573b855 [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.6.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.6.0 because...
> >
> > [1]:
> https://github.com/apache/arrow-julia/tree/771db0a31685e6b12e8b576685b7d8d5c573b855
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.6.0-rc1/
> > [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.5.2 RC1

2023-04-19 Thread Jacob Quinn
+1 (macOS M1)

-Jacob

On Tue, Apr 18, 2023 at 1:59 AM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.5.2.
>
> This release candidate is based on commit:
> 686ab570b831035715cb58f666233ec673e50d8f [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.5.2
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.5.2 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/686ab570b831035715cb58f666233ec673e50d8f
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.2-rc1/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.5.1 RC2

2023-04-15 Thread Jacob Quinn
Verified on macos m1

+1

-Jacob

On Sat, Apr 15, 2023 at 7:19 AM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC2) of
> Apache Arrow Julia version 2.5.1.
>
> This release candidate is based on commit:
> e6c44ddbe0fb0c336fad31aa5a84f0b167495d31 [1]
>
> The source release rc2 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.5.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.5.1 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/e6c44ddbe0fb0c336fad31aa5a84f0b167495d31
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.1-rc2/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.5.1 RC1

2023-04-11 Thread Jacob Quinn
Hmmm, I'm also on MacOS m1, but didn't have any issues running tests.

David, is the error reproducible? We fixed an issue for this in [this
commit](
https://github.com/apache/arrow-julia/commit/6d0ac4946f062414e2b60aa3d67c2875bb2e7958),
but it's possible that our check for this condition wasn't strong enough or
something. If it's reproducible, I'd appreciate being able to do a debug
build for you and have it report some data around our check for this.

-Jacob

On Tue, Apr 11, 2023 at 6:36 PM David Li  wrote:

> I had an issue during verification (macOS/AArch64) [1]
>
> The gist seems to be:
>
> ```
>   nested task error: ArgumentError: unsafe_wrap: pointer 0x293389438
> is not properly aligned to 16 bytes
>   Stacktrace:
> [1] #unsafe_wrap#100
>   @ ./pointer.jl:92 [inlined]
> [2] unsafe_wrap
>   @ ./pointer.jl:90 [inlined]
> [3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}},
> batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer,
> compression::Arrow.Flatbuf.BodyCompression)
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:557
> [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal,
> batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64,
> Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:685
> [5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch,
> rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding},
> nodeidx::Int64, bufferidx::Int64, convert::Bool)
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:498
> [6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:474
> [7] iterate
>   @
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:471
> [inlined]
> [8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
>   @ Base ./abstractarray.jl:946
> [9] _collect
>   @ ./array.jl:713 [inlined]
>[10] collect
>   @ ./array.jl:707 [inlined]
>[11] macro expansion
>   @
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:376
> [inlined]
>[12] (::Arrow.var"#108#114"{Bool, Channel{Any},
> WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding},
> Arrow.Batch, Int64})()
>   @ Arrow ./threadingconstructs.jl:341
> ```
>
> I haven't gotten a chance to look more into it/try again.
>
> [1]: https://gist.github.com/lidavidm/b8f604b60c0a2cdfb04e96d4e58bdfdb
>
> On Wed, Apr 12, 2023, at 06:50, Sutou Kouhei wrote:
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.5.1.
> >
> > This release candidate is based on commit:
> > 22088f1cb59bcd99fbffbf9d8248e491690dbfd9 [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.5.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.5.1 because...
> >
> > [1]:
> >
> https://github.com/apache/arrow-julia/tree/22088f1cb59bcd99fbffbf9d8248e491690dbfd9
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.1-rc1/
> > [3]:
> >
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.5.0 RC1

2023-03-15 Thread Jacob Quinn
+1

Tested on MacOS m1

On Tue, Mar 14, 2023 at 11:56 PM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.5.0.
>
> This release candidate is based on commit:
> 4d71bee55249dae32983971362256798a9af38bf [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.5.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.5.0 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/4d71bee55249dae32983971362256798a9af38bf
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.0-rc1/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>
>
> Thanks,
> --
> kou
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1

2023-02-02 Thread Jacob Quinn
+1

Ran on macos m1.

-Jacob

On Thu, Feb 2, 2023 at 7:53 PM Sutou Kouhei  wrote:

> +1
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   VERIFY_FORCE_USE_JULIA_BINARY=1 dev/release/verify_rc.sh 2.4.3 1
>
>
> Thanks,
> --
> kou
>
>
> In <20230203.113400.196149433832986@clear-code.com>
>   "[VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1" on Fri, 03 Feb 2023
> 11:34:00 +0900 (JST),
>   Sutou Kouhei  wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.4.3.
> >
> > This release candidate is based on commit:
> > 8c0cc4498801758064bd72ffa2fa6460cfc51fdc [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.4.3
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.4.3 because...
> >
> > [1]:
> https://github.com/apache/arrow-julia/tree/8c0cc4498801758064bd72ffa2fa6460cfc51fdc
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.3-rc1/
> > [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
> >
> >
> > Thanks,
> > --
> > kou
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.4.2 RC0

2023-01-14 Thread Jacob Quinn
+1 (binding)

Verified on MacOS m1.

-Jacob

On Fri, Jan 13, 2023 at 6:17 PM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC0) of
> Apache Arrow Julia version 2.4.2.
>
> This release candidate is based on commit:
> 5ba768918f8088c41e5f89ae890235354a887fd6 [1]
>
> The source release rc0 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.4.2
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.4.2 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/5ba768918f8088c41e5f89ae890235354a887fd6
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.2-rc0/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>
>
> Thanks,
> --
> kou
>


Re: Dictionary Key For Null Slot

2022-11-29 Thread Jacob Quinn
I was just looking into a related issue last night where it seems pandas
complains if there are _any_ nulls in the dictionary and we were
considering not allowing nulls in the dictionary values at all. But it's a
little tangled up at the moment because we've already allowed it. Ref:
https://github.com/apache/arrow-julia/issues/360

-Jacob

On Tue, Nov 29, 2022 at 8:06 AM Raphael Taylor-Davies
 wrote:

> Hi All,
>
> I am not sure if it is intentional, but a common property of all arrow
> layouts is that the value at a given index is defined, even if for a
> null it may contain an arbitrary value. This is true everywhere except
> for the dictionary layout, where the key in the null slot may contain an
> arbitrary value, and consequently the value of the index is undefined.
>
> This has been a repeated nuisance in the Rust implementation, but so far
> I've managed to find workarounds for most issues, however, I'm unsure
> how to handle StructArrays containing non-nullable, dictionary-encoded
> children. As the children are non-nullable, they cannot contain a null
> mask, but without a null mask the child dictionary array is ill-formed.
> I'm not really sure how best to handle this?
>
> One option might be to require that all dictionary keys, even those for
> null slots, are a valid index into the child values array. As the child
> values array can itself contain nulls, this is always possible.
>
> My questions are therefore:
>
> * How are other implementations handling this case?
>
> * Is requiring all dictionary keys to be a valid index into the child
> values acceptable? We already do something similar for offsets
>
> * What is the motivation for dictionaries having two levels of
> nullability, both in the keys and values. UnionArray by contrast only
> encodes nullability in its children
>
> Any help would be much appreciated
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.4.1 RC0

2022-11-16 Thread Jacob Quinn
+1

(tested on macos m1, Julia 1.8.2 and julia#master)

On Wed, Nov 16, 2022 at 4:22 PM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC0) of
> Apache Arrow Julia version 2.4.1.
>
> This release candidate is based on commit:
> 23258f12bb4b28eb3846d0d3a91a54e2628254d1 [1]
>
> The source release rc0 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.4.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.4.1 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/23258f12bb4b28eb3846d0d3a91a54e2628254d1
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.1-rc0/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>
> Thanks,
> --
> kou
>


Re: [VOTE] Move issue tracking to GitHub Issues

2022-10-26 Thread Jacob Quinn
+1

On Wed, Oct 26, 2022 at 5:04 PM Neal Richardson 
wrote:

> I propose that we move issue tracking from the ASF's Jira to GitHub Issues.
> This has been discussed on [1] and [2] and there seems to be consensus. A
> number of Arrow subprojects already use GitHub Issues; this moves the issue
> tracking for `apache/arrow` into GitHub along with the source code.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Leave ASF Jira and move to GitHub Issues
> [ ] +0
> [ ] -1 Remain in Jira because...
>
> My vote: +1
>
> Neal
>
>
> [1]: https://lists.apache.org/thread/l545m95xmf3w47oxwqxvg811or7b93tb
> [2]: https://lists.apache.org/thread/0vwj8gdo55jly5zn16wksrotyqqm0zqr
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.4.0 RC1

2022-10-26 Thread Jacob Quinn
+1 (woohoo, first official vote!)

On Tue, Oct 25, 2022 at 2:52 PM Sutou Kouhei  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.4.0.
>
> This release candidate is based on commit:
> 571a8fcf6866956d6a47390769e765d1ed0782c7 [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.4.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.4.0 because...
>
> [1]:
> https://github.com/apache/arrow-julia/tree/571a8fcf6866956d6a47390769e765d1ed0782c7
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.0-rc1/
> [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [RESULT][VOTE] Restart the Julia implementation with new repository and process

2021-12-06 Thread Jacob Quinn
Thanks kou,

I think I've got all the details summarized in [this issue](
https://github.com/JuliaData/Arrow.jl/issues/265) where we'll track
progress on the required CLAs. I've also reached out to the
individuals directly about signing the CLAs.

-Jacob

On Fri, Dec 3, 2021 at 4:02 PM Sutou Kouhei  wrote:

> Hi Jacob,
>
> Thanks for helping this!
>
> I'm filling IP clearance form:
>
> https://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-julia-library2.xml
>
> (I think that we can see it at
> https://incubator.apache.org/ip-clearance/arrow-julia-library2.html
> soon.)
>
> Could you stop merging any pull requests until this IP
> clearance is finished? (I think that it's acceptable because
> the last commit of https://github.com/JuliaData/Arrow.jl is
> 2021-11-02.)
>
> Here are TODO items:
>
>   1. Check that all active committers have a signed CLA on record.
>
>   2. Remind active committers that they are responsible for
>  ensuring that a Corporate CLA is recorded if such is
>  required to authorize their contributions under their
>  individual CLA.
>
> I think that people who commit after
> https://incubator.apache.org/ip-clearance/arrow-julia-library.html
> (8583da8a84a9e355affb42654dcd8c765bcc3134) are target:
>
> $ git shortlog -sn
> 8583da8a84a9e355affb42654dcd8c765bcc3134..1447cb2b13b728729f9a89760ac07a848e31e599
> 53  Jacob Quinn
> 16  Jarrett Revels
> 10  Curtis Vogt
>  8  Eric Hanson
>  2  Douglas Bates
>  2  ExpandingMan
>  2  Kristoffer Carlsson
>  1  Damien Drix
>  1  Denis Barucic
>  1  Jon Alm Eriksen
>  1  KronosTheLate
>  1  Nick Robinson
>  1  Pietro Vertechi
>  1  Simeon Schaub
>  1  Tanmay Mohapatra
>  1  Étienne Tétreault-Pinard
>
> According to the discussion in Ballista's IP clearance process,
>   https://lists.apache.org/thread/k0k7x3rrg56nk8s2c1tvrrv76zl2b1m4
> we can remove people who has only 1 commit from the target list:
>
> 53  Jacob Quinn
> 16  Jarrett Revels
> 10  Curtis Vogt
>  8  Eric Hanson
>  2  Douglas Bates
>  2  ExpandingMan
>  2  Kristoffer Carlsson
>
> And we can also remove people who commit before
> (including) 8583da8a84a9e355affb42654dcd8c765bcc3134 because
> they signed CLA in the previous IP clearance:
>
> $ git shortlog -sn
> 16b729db74d78ecb010efab855c9e46c8052f59e..8583da8a84a9e355affb42654dcd8c765bcc3134
> | cat
>103  ExpandingMan
> 53  Jacob Quinn
>  4  David Anthoff
>  2  Jacob Adenbaum
>  2  Spencer Lyon
>  1  Graham Stark
>  1  John Myles White
>  1  Michael Savastio
>  1  TheCedarPrince
>
> So here is the target list:
>
> 16  Jarrett Revels
> 10  Curtis Vogt
>  8  Eric Hanson
>  2  Douglas Bates
>  2  Kristoffer Carlsson
>
> Jacob, could you ask them to sign CLA for the TODO item 1.?
> FYI: Ballista used a GitHub issue for this:
>
> https://github.com/ballista-compute/ballista/issues/646#issuecomment-806820971
>
> And could you also remind
>
> 53  Jacob Quinn
> 16  Jarrett Revels
> 10  Curtis Vogt
>  8  Eric Hanson
>  2  Douglas Bates
>  2  ExpandingMan
>  2  Kristoffer Carlsson
>
> that the TODO item 2.?
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [RESULT][VOTE] Restart the Julia implementation with new repository
> and process" on Mon, 29 Nov 2021 20:43:32 -0700,
>   Jacob Quinn  wrote:
>
> > Thanks kou,
> >
> > I'm happy to help in any way I can. I think I know all the Arrow.jl
> > contributors personally, so I'm happy to reach out to them for whatever
> is
> > needed.
> >
> > -Jacob
> >
> > On Wed, Nov 24, 2021 at 1:04 AM Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> Sorry for not working on this. I asked this on
> >> gene...@incubator.apache.org [1] and got a reply that there
> >> is a project that use the GitHub's transfer repository
> >> feature [2].
> >>
> >> Let's start IP clearance process against
> >> https://github.com/JuliaData/Arrow.jl and use the GitHub's
> >> transfer repository feature after the IP clearance is
> >> passed.
> >>
> >> Are there any people who help this process? I think that we
> >> need to fill IP clearance form based on
> >> https://incubator.apache.org/ip-clearance/ip-clearance-template.html
> >> .
> >>
> >>
> >> [1] https://lists.apache.org/thread/6nqbzkp4owt43l66283d55302mjrjkzf
> >> [2] https://lists.apache.org/thread/1

Re: [RESULT][VOTE] Restart the Julia implementation with new repository and process

2021-11-29 Thread Jacob Quinn
Thanks kou,

I'm happy to help in any way I can. I think I know all the Arrow.jl
contributors personally, so I'm happy to reach out to them for whatever is
needed.

-Jacob

On Wed, Nov 24, 2021 at 1:04 AM Sutou Kouhei  wrote:

> Hi,
>
> Sorry for not working on this. I asked this on
> gene...@incubator.apache.org [1] and got a reply that there
> is a project that use the GitHub's transfer repository
> feature [2].
>
> Let's start IP clearance process against
> https://github.com/JuliaData/Arrow.jl and use the GitHub's
> transfer repository feature after the IP clearance is
> passed.
>
> Are there any people who help this process? I think that we
> need to fill IP clearance form based on
> https://incubator.apache.org/ip-clearance/ip-clearance-template.html
> .
>
>
> [1] https://lists.apache.org/thread/6nqbzkp4owt43l66283d55302mjrjkzf
> [2] https://lists.apache.org/thread/15fx1j0zdnwmxxr0zo1mjf34gjwkxxly
>
>
> Thanks,
> --
> kou
>
> In <20211013.053514.2154036949522056512@clear-code.com>
>   "Re: [RESULT][VOTE] Restart the Julia implementation with new repository
> and process" on Wed, 13 Oct 2021 05:35:14 +0900 (JST),
>   Sutou Kouhei  wrote:
>
> > Hi Jacob,
> >
> > It's a good idea if we can do this.
> >
> > Does anyone know where we can ask this?
> >
> > gene...@incubator.apache.org ?
> > https://lists.apache.org/list.html?gene...@incubator.apache.org
> >
> > INFRA JIRA?
> > https://issues.apache.org/jira/projects/INFRA
> >
> > Or ...?
> >
> >
> > It seems that we can use existing GitHub repository's
> > codebase for IP clearance. Some items listed in
> > https://incubator.apache.org/ip-clearance/ does so such as
> > https://incubator.apache.org/ip-clearance/daffodil-vscode-debugger
> > .
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [RESULT][VOTE] Restart the Julia implementation with new
> repository and process" on Tue, 12 Oct 2021 09:24:36 -0600,
> >   Jacob Quinn  wrote:
> >
> >> Hi kou,
> >>
> >> I'm looking into the next steps and wondering if it's possible to use
> the
> >> Github mechanism of "transferring a repository" (
> >>
> https://docs.github.com/en/repositories/creating-and-managing-repositories/transferring-a-repository
> ),
> >> since that could simplify a lot of things. It would retain existing
> github
> >> actions CI and other repository settings, auto-generate a redirect from
> the
> >> existing JuliaData/Arrow.jl repo, and completely preserve the commit/git
> >> history already in place.
> >>
> >> Do we know if this is a possibility? I realize we'd need to do the IP
> >> clearance before transferring, which is fine; just wondering if we can
> >> leverage this functionality from github?
> >>
> >> On Sat, Oct 2, 2021 at 11:19 PM Sutou Kouhei 
> wrote:
> >>
> >>> Hi Jacob,
> >>>
> >>> Could you open a pull request to import
> >>> https://github.com/JuliaData/Arrow.jl on
> >>> https://github.com/apache/arrow-julia like
> >>> https://github.com/apache/arrow/pull/8547 ?
> >>>
> >>>
> >>> Thanks,
> >>> --
> >>> kou
> >>>
> >>> In <20211003.140948.2107475918212883624@clear-code.com>
> >>>   "Re: [RESULT][VOTE] Restart the Julia implementation with new
> repository
> >>> and process" on Sun, 03 Oct 2021 14:09:48 +0900 (JST),
> >>>   Sutou Kouhei  wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >>   * GitHub notification list: comm...@arrow.apache.org
> >>> >
> >>> > I should have used git...@arrow.apache.org for this. I've
> >>> > fixed this by pushing .asf.yaml to apache/arrow-julia:
> >>> >
> >>> >   https://github.com/apache/arrow-julia/blob/main/.asf.yaml
> >>> >
> >>> > I needed to use
> >>> > https://gitbox.apache.org/repos/asf/arrow-julia.git for the
> >>> > first push. I couldn't use
> >>> > g...@github.com:apache/arrow-julia.git .
> >>> >
> >>> >
> >>> > Thanks,
> >>> > --
> >>> > kou
> >>> >
> >>> > In <20211003.134022.661649063345488310@clear-code.com>
> >>> >   "Re: [RESULT][VOTE] Restart the Julia implementation with new
> >>> repo

Re: [RESULT][VOTE] Restart the Julia implementation with new repository and process

2021-10-12 Thread Jacob Quinn
Hi kou,

I'm looking into the next steps and wondering if it's possible to use the
Github mechanism of "transferring a repository" (
https://docs.github.com/en/repositories/creating-and-managing-repositories/transferring-a-repository),
since that could simplify a lot of things. It would retain existing github
actions CI and other repository settings, auto-generate a redirect from the
existing JuliaData/Arrow.jl repo, and completely preserve the commit/git
history already in place.

Do we know if this is a possibility? I realize we'd need to do the IP
clearance before transferring, which is fine; just wondering if we can
leverage this functionality from github?

On Sat, Oct 2, 2021 at 11:19 PM Sutou Kouhei  wrote:

> Hi Jacob,
>
> Could you open a pull request to import
> https://github.com/JuliaData/Arrow.jl on
> https://github.com/apache/arrow-julia like
> https://github.com/apache/arrow/pull/8547 ?
>
>
> Thanks,
> --
> kou
>
> In <20211003.140948.2107475918212883624@clear-code.com>
>   "Re: [RESULT][VOTE] Restart the Julia implementation with new repository
> and process" on Sun, 03 Oct 2021 14:09:48 +0900 (JST),
>   Sutou Kouhei  wrote:
>
> > Hi,
> >
> >>   * GitHub notification list: comm...@arrow.apache.org
> >
> > I should have used git...@arrow.apache.org for this. I've
> > fixed this by pushing .asf.yaml to apache/arrow-julia:
> >
> >   https://github.com/apache/arrow-julia/blob/main/.asf.yaml
> >
> > I needed to use
> > https://gitbox.apache.org/repos/asf/arrow-julia.git for the
> > first push. I couldn't use
> > g...@github.com:apache/arrow-julia.git .
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In <20211003.134022.661649063345488310@clear-code.com>
> >   "Re: [RESULT][VOTE] Restart the Julia implementation with new
> repository and process" on Sun, 03 Oct 2021 13:40:22 +0900 (JST),
> >   Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> I've created apache/arrow-julia from
> >> https://gitbox.apache.org/setup/newrepo.html with:
> >>
> >>   * PMC: arrow
> >>   * Repository name: julia
> >>   * Generated name: arrow-jlia.git
> >>   * Repository description: Apache Arrow Julia
> >>   * Commit notification list: comm...@arrow.apache.org
> >>   * GitHub notification list: comm...@arrow.apache.org
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <20211003.132505.1038511014845186183@clear-code.com>
> >>   "[RESULT][VOTE] Restart the Julia implementation with new repository
> and process" on Sun, 03 Oct 2021 13:25:05 +0900 (JST),
> >>   Sutou Kouhei  wrote:
> >>
> >>> Hi,
> >>>
> >>> The vote carries with 8 +1 binding votes, 3 +1 non-binding
> >>> votes and no -1 votes.
> >>>
> >>> I'll create apache/arrow-julia and start IP clearance
> >>> process to import JuliaData/Arrow.jl to apache/arrow-julia.
> >>>
> >>>
> >>> Thanks,
> >>> --
> >>> kou
> >>>
> >>> In <20210927.115838.114416636593478@clear-code.com>
> >>>   "[VOTE] Restart the Julia implementation with new repository and
> process" on Mon, 27 Sep 2021 11:58:38 +0900 (JST),
> >>>   Sutou Kouhei  wrote:
> >>>
>  Hi,
> 
>  This vote is to determine if the Arrow PMC is in favor of
>  the Julia community moving the Julia implementation of
>  Apache Arrow out of apache/arrow into apache/arrow-julia.
> 
>  The Julia community uses a process like the Rust community
>  uses [1][2].
> 
>  Here is a summary of the process:
> 
>    1. Use GitHub instead of JIRA for issue management platform
> 
>   Note: Contributors will be required to write issues for
>   planned features and bug fixes so that we have
>   visibility and opportunities for collaboration before a
>   PR shows up.
> 
>   (This is for the Apache way.)
> 
>   [1]
> 
>    2. Release on demand
> 
>   Like DataFusion.
> 
>   Release for apache/arrow doesn't include the Julia
>   implementation.
> 
>   The Julia implementation uses separated version
>   scheme. (apache/arrow uses 6.0.0 as the next version
>   but the next Julia implementation release doesn't use
>   6.0.0.)
> 
>   [2]
> 
>  We'll create apache/arrow-julia and start IP clearance
>  process to import JuliaData/Arrow.jl to apache/arrow after
>  the vote is passed. (We don't use julia/arrow/ in
>  apache/arrow.)
> 
>  See also discussions about this: [3][4]
> 
> 
>  Please vote whether to accept the proposal and allow the
>  Julia community to proceed with the work.
> 
>  The vote will be open for at least 72 hours.
> 
>  [ ] +1 : Accept the proposal
>  [ ] 0 : No opinion
>  [ ] -1 : Reject proposal because...
> 
> 
>  [1]
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit
>  [2]
> https://github.com/apache/arrow-datafusion/blob/master/dev/release/README.md
>  [3]
> 

Re: [VOTE] Restart the Julia implementation with new repository and process

2021-09-26 Thread Jacob Quinn
+1

On Sun, Sep 26, 2021 at 8:59 PM Sutou Kouhei  wrote:

> Hi,
>
> This vote is to determine if the Arrow PMC is in favor of
> the Julia community moving the Julia implementation of
> Apache Arrow out of apache/arrow into apache/arrow-julia.
>
> The Julia community uses a process like the Rust community
> uses [1][2].
>
> Here is a summary of the process:
>
>   1. Use GitHub instead of JIRA for issue management platform
>
>  Note: Contributors will be required to write issues for
>  planned features and bug fixes so that we have
>  visibility and opportunities for collaboration before a
>  PR shows up.
>
>  (This is for the Apache way.)
>
>  [1]
>
>   2. Release on demand
>
>  Like DataFusion.
>
>  Release for apache/arrow doesn't include the Julia
>  implementation.
>
>  The Julia implementation uses separated version
>  scheme. (apache/arrow uses 6.0.0 as the next version
>  but the next Julia implementation release doesn't use
>  6.0.0.)
>
>  [2]
>
> We'll create apache/arrow-julia and start IP clearance
> process to import JuliaData/Arrow.jl to apache/arrow after
> the vote is passed. (We don't use julia/arrow/ in
> apache/arrow.)
>
> See also discussions about this: [3][4]
>
>
> Please vote whether to accept the proposal and allow the
> Julia community to proceed with the work.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 : Accept the proposal
> [ ] 0 : No opinion
> [ ] -1 : Reject proposal because...
>
>
> [1]
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit
> [2]
> https://github.com/apache/arrow-datafusion/blob/master/dev/release/README.md
> [3]
> https://lists.apache.org/x/thread.html/r6d91286686d92837fbe21dd042801a57e3a7b00b5903ea90a754ac7b%40%3Cdev.arrow.apache.org%3E
> [4]
> https://lists.apache.org/x/thread.html/r0df7f44f7e1ed7f6e4352d34047d53076208aa78aad308e30b58f83a%40%3Cdev.arrow.apache.org%3E
>
>
> Thanks,
> --
> kou
>


Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?

2021-09-16 Thread Jacob Quinn
Good question.

In my mind, I was imagining the arrow-julia repo would have a fully
decoupled versioning from the main arrow project. This comes from my
understanding that the julia implementation is it's own "project" that
implements the arrow spec/format, and we may need a breaking major release
at different cadences than the main spec version. Indeed, while the arrow
project has gone from 2.0 -> 6.0 since the julia implementation was first
released, we're just now releasing our own 2.0.0 version after a change in
API for how metadata is set/retrieved on table/column objects.

I'll admit that it's not entirely clear to me how to best signal/implement
coordination between the main arrow project versions and the julia version
though. I'm just guessing here, but is that why the main arrow project does
so frequent major version releases? To account for any child
implementations happening to have breaking changes? I think I remember
discussion recently around moving the actual spec/format document out as a
separate repo or at least versioning it separately from all the various
implementations, and that seems like it would be a good idea, though I
guess the format itself has versioning builtin to itself. It's certainly
something we can clarify in the Julia package itself; i.e. which version of
the spec a given Julia package version is compatible with. Typically with
other julia package dependencies, just a minor version increment is
required when a new breaking dependency version is upgraded, so I would
think we could follow something similar by treating the arrow format as a
"dependency".

I'll clarify that I don't feel very strongly on these points, so if there's
something I'm missing or gaps in my understanding of how the rest of the
web of projects are coordinating things, I'm all ears.

-Jacob

On Thu, Sep 16, 2021 at 11:24 PM Sutou Kouhei  wrote:

> Hi,
>
> Good point! Jacob, could you confirm this?
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?" on Sat, 11
> Sep 2021 16:57:17 -0700,
>   QP Hou  wrote:
>
> > Just one minor point to confirm and clarify. It looks like Julia arrow
> only
> > wants to do on demand minor and patch releases. Major version release
> still
> > needs to be aligned with the main arrow release schedule, is that
> correct?
> > In other words, breaking changes should be avoided in on demand releases
> > (assuming they are using semantic versioning).
> >
> > From the original julia donation thread, I got the impression that the
> > julia maintainers wanted to have their own versioning scheme. Maybe
> that’s
> > not the case anymore. So I wanted to make sure we set the right
> expectation
> > for Julia maintainers.
> >
> > FWIW, Arrow-rs today aligns the major version with the main arrow
> release,
> > so Andrew spend quite a bit of time maintaining an active release branch
> to
> > backport backwards compatible commits for minor and patch releases.
> > Datadusion and ballista on the other hand has a versioning scheme that’s
> > fully decoupled from the main Arrow version including the major version.
> >
> > On Thu, Sep 9, 2021 at 1:38 PM Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> Thanks for all comments about release schedule.
> >>
> >> Let's use release-on-demand approach based on
> >> arrow-datafusion's flow for the Julia Arrow implementation.
> >>
> >> Do we have more items to be discussed? Can we start voting?
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?" on Thu, 9
> >> Sep 2021 09:48:57 -0400,
> >>   Andrew Lamb  wrote:
> >>
> >> > I also think release on demand is a good strategy.
> >> >
> >> > The primary reasons to do an arrow-rs release every 2 weeks were:
> >> > 1. To have predictable cadence into downstream projects (e.g.
> datafusion
> >> > and others)
> >> > 2. Amortize the overhead associated with each release (the process is
> non
> >> > trivial and the current 72 hour voting window adds some backpressure
> as
> >> > well -- I remember Wes may have said windows shorter than 72 hours
> might
> >> be
> >> > fine too)
> >> >
> >> >
> >> > On Wed, Sep 8, 2021 at 12:19 AM QP Hou 
> wrote:
> >> >
> >> >> A minor note on the Rust side of things. arrow-rs has a 2 weeks
> >> >> release cycle, but arrow-datafusion mostly does release on demand at
> >>

Re: [DISCUSS][Julia] How to restart at apache/arrow-julia?

2021-09-07 Thread Jacob Quinn
Thanks kou.

I think the TODO action list looks good.

The one point I think could use some additional discussion is around the
release cadence: it IS desirable to be able to release more frequently than
the parent repo 3-4 month cadence. But we also haven't had the frequency of
commits to necessarily warrant a release every 2 weeks. I can think of two
possible options, not sure if one or the other would be more compatible
with the apache release process:

1) Allow for release-on-demand; this is idiomatic for most Julia packages
I'm aware of. When a particular bug is fixed, or feature added, a user can
request a release, a little discussion happens, and a new release is made.
This approach would work well for the "bursty" kind of contributions we've
seen to Arrow.jl where development by certain people will happen frequently
for a while, then take a break for other things. This also avoids having
"scheduled" releases (every 2 weeks, 3 months, etc.) where there hasn't
been significant updates to necessarily warrant a new release. This
approach may also facilitate differentiating between bugfix (patch)
releases vs. new functionality releases (minor), since when a release is
requested, it could be specified whether it should be patch or minor (or
major).

2) Commit to a scheduled release pattern like every 2 weeks, once a month,
etc. This has the advantage of consistency and clearer expectations for
users/devs involved. A release also doesn't need to be requested, because
we can just wait for the scheduled time to release. In terms of the
"unnecessary releases" mentioned above, it could be as simple as
"cancelling" a release if there hasn't been significant updates in the
elapsed time period.

My preference would be for 1), but that's influenced from what I'm familiar
with in the Julia package ecosystem. It seems like it would still fit in
the apache way since we would formally request a new release, wait the
elapsed amount of time for voting (24 hours would be preferrable), then at
the end of the voting period, a new release could be made.

Thanks again kou for helping support the Julia implementation here.

-Jacob

2)

On Sun, Sep 5, 2021 at 3:25 PM Sutou Kouhei  wrote:

> Hi,
>
> Sorry for the delay. This is a continuation of the "Status
> of Arrow Julia implementation?" thread:
>
>
> https://lists.apache.org/x/thread.html/r6d91286686d92837fbe21dd042801a57e3a7b00b5903ea90a754ac7b%40%3Cdev.arrow.apache.org%3E
>
> I summarize the current status, the next actions and items
> to be discussed.
>
> The current status:
>
>   * The Julia Arrow implementation uses
> https://github.com/JuliaData/Arrow.jl as a "dev branch"
> instead of creating a branch in
> https://github.com/apache/arrow
>   * The Julia Arrow implementation wants to use GitHub
> for the main issue management platform
>   * The Julia Arrow implementation wants to release
> more frequency than 1 release per 3-4 months
>   * The current workflow of the Rust Arrow implementation
> will also fit the Julia Arrow implementation
>
> The current workflow of the Rust Arrow implementation:
>
>
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit#heading=h.kv1hwbhi3cmi
>
> * Uses apache/arrow-rs and apache/arrow-datafusion instead
>   of apache/arrow for repository
>
> * Uses GitHub instead of JIRA for issue management
>   platform
>
>
> https://docs.google.com/document/d/1tMQ67iu8XyGGZuj--h9WQYB9inCk6c2sL_4xMTwENGc/edit
>
> * Releases a new minor and patch version every 2 weeks
>   in addition to the quarterly release of the other releases
>
> The next actions after we get a consensus about this
> discussion:
>
>   1. Start voting the Julia Arrow implementation move like
>  the Rust's one:
>
>
> https://lists.apache.org/x/thread.html/r44390a18b3fbb08ddb68aa4d12f37245d948984fae11a41494e5fc1d@%3Cdev.arrow.apache.org%3E
>
>   2. Create apache/arrow-julia
>
>   3. Start IP clearance process to import JuliaData/Arrow.jl
>  to apache/arrow-julia
>
>  (We don't use julia/Arrow/ in apache/arrow.)
>
>   4. Import JuliaData/Arrow.jl to apache/arrow-julia
>
>   5. Prepare integration tests CI in apache/arrow-julia and apache/arrow
>
>   6. Prepare releasing tools in apache/arrow-julia and apache/arrow
>
>   7. Remove julia/... from apache/arrow and leave
>  julia/README.md pointing to apache/arrow-julia
>
>
> Items to be discussed:
>
>   * Interval of minor and patch releases
>
> * The Rust Arrow implementation uses 2 weeks.
>
> * Does the Julia Arrow implementation also wants to use
>   2 weeks?
>
>   * Can we accordance with the Apache way with this workflow
> without pain?
>
> The Rust Arrow implementation workflow includes the
> following for this:
>
>
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit#heading=h.kv1hwbhi3cmi
>
>   > Contributors will be required to write issues for
>   > planned 

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Jacob Quinn
>
> I just thought of one other requirement: the format needs to support
> arbitrary byte sequences.
>
Can you clarify why this is needed? Is it that custom_metadata maps should
allow byte sequences as values?

On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud  wrote:

> On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou 
> wrote:
>
> >
> > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > >
> > >> I.e. make the ability to read and write by humans be more important
> than
> > >> speed of validation.
> > >
> > > I think I differ on whether the IR should be easy to read and write by
> > > humans.
> > > IR is going to be predominantly read and written by machines, though of
> > > course
> > > we will need a way to inspect it for debugging.
> >
> > But the code executed by machines is written by humans.  I think that's
> > mostly where the contention resides: is it easy to code, in any given
> > language, the routines required to produce or consume the IR?
> >
>
> Definitely not for flatbuffers, since flatbuffers is IMO annoying to use in
> any language except C++,
> and it's borderline annoying there too. Protobuf is similar (less annoying
> in Rust,
> but still annoying in Python and C++ IMO), though I think any binary format
> is going to be
> less human-friendly, by construction.
>
> If we were to use something like JSON or msgpack, can someone sketch out
> the interaction
> between the IR and the rest of arrow's type system?
>
> Would we need a JSON-encoded-arrow-type -> in-memory representation for an
> Arrow type in a given language?
>
> I just thought of one other requirement: the format needs to support
> arbitrary byte sequences. JSON
> doesn't support untransformed byte sequences, though it's not uncommon to
> base64-encode a byte sequence.
> IMO that adds an unnecessary layer of complexity, which is another tradeoff
> to consider.
>


Re: Status of Arrow Julia implementation?

2021-06-25 Thread Jacob Quinn
Hi Kou,

Sorry for the slow response here, but it's been great to see how the new
Rust process has shaken out and I think it working well. I'd like to move
forward with transferring the JuliaData/Arrow.jl repository to
apache/arrow-julia and following a similar process to Rust in terms of
development/release. I can start on a Julia-specific proposal with
specifics.

Thanks for all the help!

-Jacob

On Sun, Apr 25, 2021 at 11:56 PM Sutou Kouhei  wrote:

> Hi,
>
> I think that we can say the Rust migration is complete once
> we merge https://github.com/apache/arrow/pull/10096. But
> it's a good time to think about the Julia migration.
>
> Jacob, here is the Rust's new development process:
>
>
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit#
>
> (It seems that an anonymous user deleted a part of it
> accidentally.)
>
> Do you want to use the same development process as the
> Rust's one? Do you have any item you want to change?
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: Status of Arrow Julia implementation?" on Sun, 25 Apr 2021 13:34:04
> -0700,
>   Micah Kornfield  wrote:
>
> > It seems the Rust migration is now complete.  Do we want to wait to iron
> > out the other potential issues?
> >
> > I think the outstanding ones might be:
> > 1.  Issue management
> > 2.  Integration testing
> >
> > -Micah
> >
> > On Wed, Apr 14, 2021 at 11:01 PM Sutou Kouhei 
> wrote:
> >
> >> Hi Jacob,
> >>
> >> OK. Here is my plan:
> >>
> >>   1. We wait for the Rust's move to complete
> >>   2. We use a process similar to the Rust's move
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "Re: Status of Arrow Julia implementation?" on Wed, 14 Apr 2021
> 08:37:41
> >> -0600,
> >>   Jacob Quinn  wrote:
> >>
> >> > Thank you kou! I appreciate the help. I'm happy to do whatever is
> >> required
> >> > to facilitate the moving/donating process from JuliaData/Arrow.jl to
> >> > apache/arrow-julia.
> >> >
> >> > -Jacob
> >> >
> >> > On Mon, Apr 12, 2021 at 7:53 PM Sutou Kouhei 
> wrote:
> >> >
> >> >> Hi Jacob,
> >> >>
> >> >> I, a PMC member, talked to Kenta Murata, a commiter and a
> >> >> Julia user, about this.
> >> >>
> >> >> We support that you and Julia folks work on
> >> >> arrow/arrow-julia until we have enough PMC members from
> >> >> Julia folks. For example, we'll help IP clearance process to
> >> >> import the latest JuliaData/Arrow.js changes to apache/ and
> >> >> we'll start voting on Julia package release.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> kou
> >> >>
> >> >> In  d-heskgn2mm57...@mail.gmail.com>
> >> >>   "Re: Status of Arrow Julia implementation?" on Sun, 11 Apr 2021
> >> 23:06:27
> >> >> -0600,
> >> >>   Jacob Quinn  wrote:
> >> >>
> >> >> > Micah/Wes,
> >> >> >
> >> >> > Yes, I've been following the rust proposal thread with great
> >> interest. I
> >> >> do
> >> >> > think that provides a great path forward: transferring the
> >> >> > JuliaData/Arrow.jl repo to apache/arrow-julia would help to solve
> the
> >> >> > "package history" technical challenges that in part led to the
> current
> >> >> > setup and concerns. I think being able to utilize github issues
> would
> >> >> also
> >> >> > be great; as I've mentioned elsewhere, it's much more
> >> >> traditional/expected
> >> >> > in the Julia ecosystem.
> >> >> >
> >> >> > I think the package could retain an independent versioning scheme.
> The
> >> >> >> additional process would be voting on release candidates. If the
> >> Julia
> >> >> >> folks want to try again and move development to a new,
> Julia-specific
> >> >> >> apache/* repository and apply the ASF governance to the project,
> the
> >> >> >> Arrow PMC could probably fast-track making Jacob a committer. In
> some
> >> >> >> code donations / IP clearance, the contributors for the donated
&

Re: Status of Arrow Julia implementation?

2021-04-14 Thread Jacob Quinn
Thank you kou! I appreciate the help. I'm happy to do whatever is required
to facilitate the moving/donating process from JuliaData/Arrow.jl to
apache/arrow-julia.

-Jacob

On Mon, Apr 12, 2021 at 7:53 PM Sutou Kouhei  wrote:

> Hi Jacob,
>
> I, a PMC member, talked to Kenta Murata, a commiter and a
> Julia user, about this.
>
> We support that you and Julia folks work on
> arrow/arrow-julia until we have enough PMC members from
> Julia folks. For example, we'll help IP clearance process to
> import the latest JuliaData/Arrow.js changes to apache/ and
> we'll start voting on Julia package release.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: Status of Arrow Julia implementation?" on Sun, 11 Apr 2021 23:06:27
> -0600,
>   Jacob Quinn  wrote:
>
> > Micah/Wes,
> >
> > Yes, I've been following the rust proposal thread with great interest. I
> do
> > think that provides a great path forward: transferring the
> > JuliaData/Arrow.jl repo to apache/arrow-julia would help to solve the
> > "package history" technical challenges that in part led to the current
> > setup and concerns. I think being able to utilize github issues would
> also
> > be great; as I've mentioned elsewhere, it's much more
> traditional/expected
> > in the Julia ecosystem.
> >
> > I think the package could retain an independent versioning scheme. The
> >> additional process would be voting on release candidates. If the Julia
> >> folks want to try again and move development to a new, Julia-specific
> >> apache/* repository and apply the ASF governance to the project, the
> >> Arrow PMC could probably fast-track making Jacob a committer. In some
> >> code donations / IP clearance, the contributors for the donated code
> >> become committers as part of the transaction.
> >>
> >
> > These all sound great and would greatly facilitate a better integration
> > under ASF governance. These points definitely resolve my main concerns.
> >
> > As I commented on the rust thread, I'm mostly interested in the future of
> > integration testing for rust/julia if they are split out into separate
> > repos. In the current Julia implementation, we have all the code to read
> > arrow json, and I just hand-generated the integration test data and
> > committed them in the repo itself, but it doesn't interface with other
> > languages (just reads arrow json, produces arrow file, reads arrow file,
> > compares w/ original arrow json). I'm happy to help work on the details
> of
> > what that looks like and pilot some solutions. I think with a solid
> > inter-repo integration testing framework, we can keep a strong sync
> between
> > projects.
> >
> > -Jacob
> >
> >
> > On Sun, Apr 11, 2021 at 5:08 PM Wes McKinney 
> wrote:
> >
> >> On Sat, Apr 10, 2021 at 4:07 PM Micah Kornfield 
> >> wrote:
> >> >
> >> > >
> >> > > Ok, I've had a chance to discuss with a few other Julia developers
> and
> >> > > review various options. I think it's best to drop the Julia code
> from
> >> the
> >> > > physical apache/arrow repo. The extra overhead on development,
> release
> >> > > process, and user issue reporting and PR contributing are too much
> in
> >> > > addition to the technical challenges that we never resolved
> involving
> >> > > including the past Arrow.jl release version git trees in the
> >> apache/arrow
> >> > > repo.
> >> >
> >> >
> >> > Hi Jacob,
> >> > It seems you are on the new thread discussing a proposal for changing
> >> > Rust's development model.   Would the proposal [1] address most of
> these
> >> > concerns if Julia was set up in the same way?
> >> >
> >> >  It seems in the short term the stickiest point would be committer
> access
> >> > to the new repos, and I suppose the release mechanics still might be
> >> > challenging?
> >>
> >> I think the package could retain an independent versioning scheme. The
> >> additional process would be voting on release candidates. If the Julia
> >> folks want to try again and move development to a new, Julia-specific
> >> apache/* repository and apply the ASF governance to the project, the
> >> Arrow PMC could probably fast-track making Jacob a committer. In some
> >> code donations / IP clearance, the contributors for the donated code
> >> become committers as part of the transaction.
>

Re: Status of Arrow Julia implementation?

2021-04-11 Thread Jacob Quinn
he timing and
> > > frequency of releases for the Julia codebase are in my mind easy to
> > > resolve, and if you had indicated that having a customized process for
> > > Julia releases was a condition for your joining the community
> > > wholeheartedly, we would have been happy to help. I think that the
> > > benefits of common CI/CD infrastructure and opportunities to build
> > > deeper integrations between the Julia implementation and the other
> > > implementations (imagine... Julia kernels running in DataFusion?)
> > > would outweigh the sense of "loss of control" from developing within a
> > > larger project.
> > >
> > > On Wed, Apr 7, 2021 at 12:16 AM Jacob Quinn 
> > > wrote:
> > > >
> > > > Responses inline below:
> > > >
> > > > On Tue, Apr 6, 2021 at 9:46 PM Jorge Cardoso Leitão <
> > > > jorgecarlei...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > > you all did not attempt to work in the community for any
> meaningful
> > > > > amount of time and
> > > > > are choosing not to try based on the perception that it will create
> > > > > unacceptable overhead for you
> > > > >
> > > > > It is not self-evident to me that Julia's community was
> sufficiently
> > > > > informed about what they
> > > > > had to give in in terms of process and release management when
> merging
> > > /
> > > > > donating.
> > > > >
> > > >
> > > > Yes, it was pretty unclear what the process was if we needed to do
> any
> > > kind
> > > > of patch release. I know that has been sorted out better recently,
> but
> > > back
> > > > in November, it didn't really seem like an option (i.e. independent
> > > > language patch releases).
> > > >
> > > >
> > > > > IMO this is a plausible explanation as to why the donation was
> made and
> > > > > then later abandoned.
> > > > >
> > > > >
> > > > I'll just note that the "abandonment" can only be a perception from
> the
> > > > apache/arrow side of things, but as I mentioned above, I also tried
> to
> > > > clearly state in the julia/Arrow/README that the development process
> > > would
> > > > continue with the JuliaData/Arrow.jl repo as the main "dev" branch,
> with
> > > > changes being upstreamed to the apache/arrow repo, which was followed
> > > > through, having an upstream of commits right before the 3.0.0
> release,
> > > and
> > > > I was planning on doing the same soon for the 4.0.0 release. That is
> to
> > > > say, the Julia implementation has continued progressing forward quite
> > > > rapidly, IMO, but I can see that perhaps apache/arrow repo members
> may
> > > have
> > > > viewed it as "abandoned".
> > > >
> > > >
> > > > > I do not fully understand why the pain points Jacob mentioned were
> not
> > > > > brought up to the mailing list sooner, though.
> > > > >
> > > >
> > > > To be honest and frank, I didn't have pain points with the
> development
> > > > process I outlined when the code was donated and as stated in the
> README.
> > > > That was the process that made the donation possible and I imagined
> would
> > > > work well going forward, and has, until this thread started and it
> was
> > > > pointed out that this process isn't viable. The pain points were
> > > discussed
> > > > with the initial code donation, but in my mind were resolved with the
> > > > development process that was decided upon.
> > > >
> > > >
> > > > > This made us unable to potentially take corrective measures. I
> think
> > > that
> > > > > this is why everyone was taken a bit by surprise with this.
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > > > >
> > > > > On Fri, Apr 2, 2021 at 10:18 PM Wes McKinney 
> > > wrote:
> > > > >
> > > > > > hi Jacob — sorry to hear that. It's a bummer that you all did not
> > > > > > attempt to work in the community for any meaningful amount of
> time
> > > and
> >

Re: [DISCUSS] [Rust] Move Rust components to new repos and process

2021-04-10 Thread Jacob Quinn
Jorge,

* in rust, run integration tests against the latest apache/master on every
> PR
>

I've started to familiarize myself with the archery integration framework
over the last few days. Could you clarify for the "archery novices" what
exactly ^ this line would mean? Does apache/master refer to the C++
implementation as the "reference implementation", so rust would test
against/integrate with it? Or is it the arrow JSON format that needs to be
consumed into valid arrow in-memory, then produce the same arrow JSON from
in-memory arrow (this seems to be the extent of the go integration tests at
least)?

Sorry if this easily answerable from knowing archery better, but I'm still
in the learning/discovery phase of how exactly all the integration tests
are setup/run.

-Jacob


On Sat, Apr 10, 2021 at 1:03 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Wrt to integration tests, I agree that it is important to have a plan prior
> to this.
>
> What we have been doing in the apache/arrow:
>
> 1. only release if integration tests pass against each other
> 2. release the signed tar with the latest of every implementation (i.e.
> master)
>
> My suggestion for independent versioning:
>
> CI:
>
> * in rust, run integration tests against the latest apache/master on every
> PR
> * in apache/arrow, run integration tests against the latest released rust
> version
>
> Release mechanism:
>
> 1. an arrow crate can only be released if it passes integration tests
> against the current latest apache/arrow master
> 2. apache/arrow master can release if their integration tests pass against
> the latest released rust crate
>
> The common scenario is that the integration tests in apache/arrow against
> Rust pass, and thus
> apache/arrow would just need to bundle the latest rust release.
>
> If tests in apache/arrow fail, then some change in apache/arrow
> caused our latest release to stop integrating (since we integration-tested
> that version against master prior to our release).
> This implies that a current Rust release is out of spec and we thus must
> release a patch
> asap to correct for this (just like we would need to push a commit to
> apache/arrow asap).
> Once that patch is released, apache/arrow becomes green again and
> apache/arrow can bundle these on the signed apache arrow release.
>
> In the unlikely event that the latest release is unable to pass integration
> tests *and* despite the best efforts Rust is unable to release a patch in
> time, we *may* still bundle a previous release of the Rust crate, thereby
> not blocking the whole
> release (i.e. this allows us to fall back to a previous release without a
> mass revert on the apache/arrow repo).
>
> > * If Rust runs against the latest nightly of Arrow the how will Rust
> release without a new Arrow release?
>
> Not sure if this answers, but Rust does not compile or link against any
> implementation, so there are
> no ABI contracts. Its "only" contract is the spec (in-memory, IPC, flight,
> C data interface, etc).
>
> A related point is that when we release a Rust version, we can upload
> "integration test artifacts" separately (the same binaries that we
> currently use in our integration
> tests or a docker image with them), that apache/arrow can use to run
> integration tests.
> This would allow our CI at apache/arrow to download these artifacts and run
> tests as usual via archery and CLI,
> without having to compile them. This would alleviate some of the challenges
> around integration testing whereby every implementation is currently built
> on every run and in sequence.
>
> If someone thinks that it is useful, I would be happy to open a JIRA on
> this and draft a google docs
> to work out a technical design.
>
> Best,
> Jorge
>
>
> On Sat, Apr 10, 2021 at 1:57 AM Weston Pace  wrote:
>
> > > I'm assuming the idea is that the existing integration tests will
> remain
> > in apache/arrow. Will you also run the integration test suites on your
> rust
> > repository CI checks?
> >
> > Furthermore, against what version will these tests run?
> >
> > * If Arrow runs against the latest release of Rust then it will lag
> > behind and issues may be detected later.
> > * If Arrow runs against the latest nightly of Rust then things will
> > get tricky at release time (all Arrow integrations tests pass but Rust
> > isn't ready to cut a new release and Arrow tests fail against the
> > latest released Rust).
> >
> > Assuming Rust is also running integration tests against Arrow
> > (probably a good idea) you get a similar problem (this one might be
> > trickier given the relative frequencies)...
> >
> > * If Rust runs against the latest release of Arrow then it will lag
> > behind (several months).  There will be a "catching up" period after
> > Arrow releases.
> > * If Rust runs against the latest nightly of Arrow the how will Rust
> > release without a new Arrow release?
> >
> > Note, these problems technically exist now with the concept that any
> > 

Alignment not stored in arrow metadata

2021-04-06 Thread Jacob Quinn
As far as I can tell, the alignment padding used in an IPC stream/file
isn't stored explicitly, and not really "inferrable", though maybe
technically possible if you calculated what bytes are *necessary* given a
buffer's data vs. what's actually stored.

Just wondering if this has been brought up at all to store explicitly; it
came up in the Julia implementation when considering "appending" record
batches to an IPC stream that has already been written to disk; we
originally thought we would need to match alignment used in previously
written record batches, but upon further reflection, it seems like
technically it wouldn't matter since all buffers have the exact byte counts
written anyway. Just wasn't sure if it would be breaking implicit
assumptions by consumers somewhere if they happened to get an IPC stream w/
record batches that mixed, for example, 8-byte and 64-byte alignments.

-Jacob


Re: Status of Arrow Julia implementation?

2021-04-06 Thread Jacob Quinn
Responses inline below:

On Tue, Apr 6, 2021 at 9:46 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> > you all did not attempt to work in the community for any meaningful
> amount of time and
> are choosing not to try based on the perception that it will create
> unacceptable overhead for you
>
> It is not self-evident to me that Julia's community was sufficiently
> informed about what they
> had to give in in terms of process and release management when merging /
> donating.
>

Yes, it was pretty unclear what the process was if we needed to do any kind
of patch release. I know that has been sorted out better recently, but back
in November, it didn't really seem like an option (i.e. independent
language patch releases).


> IMO this is a plausible explanation as to why the donation was made and
> then later abandoned.
>
>
I'll just note that the "abandonment" can only be a perception from the
apache/arrow side of things, but as I mentioned above, I also tried to
clearly state in the julia/Arrow/README that the development process would
continue with the JuliaData/Arrow.jl repo as the main "dev" branch, with
changes being upstreamed to the apache/arrow repo, which was followed
through, having an upstream of commits right before the 3.0.0 release, and
I was planning on doing the same soon for the 4.0.0 release. That is to
say, the Julia implementation has continued progressing forward quite
rapidly, IMO, but I can see that perhaps apache/arrow repo members may have
viewed it as "abandoned".


> I do not fully understand why the pain points Jacob mentioned were not
> brought up to the mailing list sooner, though.
>

To be honest and frank, I didn't have pain points with the development
process I outlined when the code was donated and as stated in the README.
That was the process that made the donation possible and I imagined would
work well going forward, and has, until this thread started and it was
pointed out that this process isn't viable. The pain points were discussed
with the initial code donation, but in my mind were resolved with the
development process that was decided upon.


> This made us unable to potentially take corrective measures. I think that
> this is why everyone was taken a bit by surprise with this.
>
> Best,
> Jorge
>
>
> On Fri, Apr 2, 2021 at 10:18 PM Wes McKinney  wrote:
>
> > hi Jacob — sorry to hear that. It's a bummer that you all did not
> > attempt to work in the community for any meaningful amount of time and
> > are choosing not to try based on the perception that it will create
> > unacceptable overhead for you. I believe the benefits would outweigh
> > the costs, but I suppose we will have to agree to disagree.
> >
> > Can you prepare a pull request to do the requisite repository surgery?
> > I hope the development goes well in the future and look forward to
> > seeing folks from the Julia ecosystem engaged here on growing the
> > Arrow ecosystem.
> >
> > Thanks,
> > Wes
> >
> > On Fri, Apr 2, 2021 at 3:03 PM Jacob Quinn 
> wrote:
> > >
> > > Ok, I've had a chance to discuss with a few other Julia developers and
> > > review various options. I think it's best to drop the Julia code from
> the
> > > physical apache/arrow repo. The extra overhead on development, release
> > > process, and user issue reporting and PR contributing are too much in
> > > addition to the technical challenges that we never resolved involving
> > > including the past Arrow.jl release version git trees in the
> apache/arrow
> > > repo.
> > >
> > > We're still very much committed to working on the Julia implementation
> > and
> > > participating in the broader arrow community. I've enjoyed following
> the
> > > user/dev mailing lists and will continue to do so. We monitor format
> > > proposals and try to implement new functionality as quickly as
> possible.
> > We
> > > got the initial arrow flight proto code generated just last night in
> > fact.
> > > I'd still like to explore official integration with the archery test
> > suite
> > > to solidify the Julia implementation with integration tests; I think
> that
> > > would be very valuable for long-term confidence in the cross-language
> > > support of the Julia implementation.
> > >
> > > We realize one of the main implications will probably be dropping Julia
> > > from the list of "official implementations". We're encouraged by the
> many
> > > users who have already started using the Julia implementation and will
> > > strive to maintain a high rate of issue 

Re: Status of Arrow Julia implementation?

2021-04-02 Thread Jacob Quinn
Ok, I've had a chance to discuss with a few other Julia developers and
review various options. I think it's best to drop the Julia code from the
physical apache/arrow repo. The extra overhead on development, release
process, and user issue reporting and PR contributing are too much in
addition to the technical challenges that we never resolved involving
including the past Arrow.jl release version git trees in the apache/arrow
repo.

We're still very much committed to working on the Julia implementation and
participating in the broader arrow community. I've enjoyed following the
user/dev mailing lists and will continue to do so. We monitor format
proposals and try to implement new functionality as quickly as possible. We
got the initial arrow flight proto code generated just last night in fact.
I'd still like to explore official integration with the archery test suite
to solidify the Julia implementation with integration tests; I think that
would be very valuable for long-term confidence in the cross-language
support of the Julia implementation.

We realize one of the main implications will probably be dropping Julia
from the list of "official implementations". We're encouraged by the many
users who have already started using the Julia implementation and will
strive to maintain a high rate of issue responsiveness and feature
development to maintain project confidence. If there's a possibility of
being included somewhere as an "unofficial" or "semi-official"
implementation, we'd love to still be bundled with the broader arrow
project somehow, like, for example, showing how Julia integrates with the
archery test suite, once the work there is done.

Best,

-Jacob



On Tue, Mar 30, 2021 at 4:10 PM Wes McKinney  wrote:

> Also, on the issue that there are no Julia-focused PMC members — note
> that I helped the JavaScript folks make their own independent releases
> for quite a while: called the votes (e.g. [1]), helped get people to
> verify and vote on the releases. After a time, it was decided to stop
> releasing independently because there wasn't enough development
> activity to justify it.
>
> [1]: https://www.mail-archive.com/dev@arrow.apache.org/msg05971.html
>
> On Tue, Mar 30, 2021 at 4:54 PM Wes McKinney  wrote:
> >
> > hi Jacob,
> >
> > On Tue, Mar 30, 2021 at 4:18 PM Jacob Quinn 
> wrote:
> > >
> > > I can comment as the primary apache arrow liaison for the Arrow.jl
> > > repository and original code donator.
> > >
> > > I apologize for the "surprise", but I commented a few times in various
> > > places and put a snippet in the README
> > > <
> https://github.com/apache/arrow/tree/master/julia/Arrow#difference-between-this-code-and-the-juliadataarrowjl-repository
> >
> > > about
> > > the approach I wanted to take w/ the Julia implementation in terms of
> > > keeping the JuliaData/Arrow.jl repository as a "dev branch" of sorts
> of the
> > > apache/arrow code, upstreaming changes periodically. There's even a
> script
> > > <
> https://github.com/JuliaData/Arrow.jl/blob/main/scripts/update_apache_arrow_code.jl
> >
> > > I wrote to mostly automate this upstreaming. I realize now that I
> didn't
> > > consider the "Arrow PMC" position on this kind of setup or seek to
> affirm
> > > that it would be ok to approach things like this.
> > >
> > > The reality is that Julia users are very engrained to expect Julia
> packages
> > > to live in a single stand-alone github repo, where issues can be
> opened,
> > > and pull requests are welcome. It was hard and still is hard to imagine
> > > "turning that off", since I believe we would lose a lot of valuable bug
> > > reports and first-time contributions. This isn't necessarily any fault
> of
> > > how the bug report/contribution process is handled for the arrow
> project
> > > overall, though I'm also aware that there's a desire to make it easier
> > >
> > >
> > <
> https://lists.apache.org/x/thread.html/r8817dfba08ef8daa210956db69d513fd27b7a751d28fb8f27e39cc7e@%3Cdev.arrow.apache.org%3E
> >
> > > and
> > > it currently requires more and different effort than Julia users are
> used
> > > to. I think it's more from how open, welcoming, and how strong the
> culture
> > > is in Julia around encouraging community contributions and the tight
> > > integration with github and its open-source project management tools.
> > >
> >
> > Well, we are on track to having 1000 different people contribute to
> > the project and have over 12,000 issues, so I don't think there is
> > evid

Re: Status of Arrow Julia implementation?

2021-03-30 Thread Jacob Quinn
I can comment as the primary apache arrow liaison for the Arrow.jl
repository and original code donator.

I apologize for the "surprise", but I commented a few times in various
places and put a snippet in the README

about
the approach I wanted to take w/ the Julia implementation in terms of
keeping the JuliaData/Arrow.jl repository as a "dev branch" of sorts of the
apache/arrow code, upstreaming changes periodically. There's even a script

I wrote to mostly automate this upstreaming. I realize now that I didn't
consider the "Arrow PMC" position on this kind of setup or seek to affirm
that it would be ok to approach things like this.

The reality is that Julia users are very engrained to expect Julia packages
to live in a single stand-alone github repo, where issues can be opened,
and pull requests are welcome. It was hard and still is hard to imagine
"turning that off", since I believe we would lose a lot of valuable bug
reports and first-time contributions. This isn't necessarily any fault of
how the bug report/contribution process is handled for the arrow project
overall, though I'm also aware that there's a desire to make it easier

and
it currently requires more and different effort than Julia users are used
to. I think it's more from how open, welcoming, and how strong the culture
is in Julia around encouraging community contributions and the tight
integration with github and its open-source project management tools.

Additionally, I was and still am concerned about the overall release
process of the apache/arrow project. I know there have been efforts there
as well to make it easier for individual languages to release on their own
cadence, but just anecdotally, the JuliaData/Arrow.jl has had/needed/wanted
10 patch and minor releases since the original code donation, whereas the
apache/arrow project has had one (3.0.0). This leads to some of the
concerns I have with restricting development to just the apache/arrow
repository: how exactly does the release process work for individual
languages who may desire independent releases apart from the quarterly
overall project releases? I think from the Rust thread I remember that you
just need a group of language contributors to all agree, but what if I'm
the only "active" Julia contributor? It's also unclear what the
expectations are for actual development: with the original code donation
PRs, I know Neal "reviewed" the PRs, but perhaps missed the details around
how I proposed development continue going forward. Is it required to have a
certain number of reviews before merging? On the Julia side, I can try to
encourage/push for those who have contributed to the JuliaData/Arrow.jl
repository to help review PRs to apache/arrow, but I also can't guarantee
we would always have someone to review. It just feels pretty awkward if I
keep needing to ping non-Julia people to "review" a PR to merge it. Perhaps
this is just a problem of the overall Julia implementation "smallness" in
terms of contributors, but I'm not sure on the best answer here.

So in short, I'm not sure on the best path forward. I think strictly
restricting development to the apache/arrow physical repository would
actively hurt the progress of the Julia implementation, whereas it *has*
been progressing with increasing momentum since first released. There are
posts on the Julia discourse forum, in the Julia slack and zulip
communities, and quite a few issues/PRs being opened at the
JuliaData/Arrow.jl repository. There have been several calls for arrow
flight support, with a member from Julia Computing actually close to
releasing a gRPC client
 specifically
to help with flight support. But in terms of actual committers, it's been
primarily just myself, with a few minor contributions by others.

I guess the big question that comes to mind is what are the hard
requirements to be considered an "official implementation"? Does the code
*have* to live in the same physical repo? Or if it passed the series of
archery integration tests, would that be enough? I apologize for my
naivete/inexperience on all things "apache", but I imagine that's a big
part of it: having official development/releases through the apache/arrow
community, though again I'm not exactly sure on the formal processes here?
I would like to keep Julia as an official implementation, but I'm also
mostly carrying the maintainership alone at the moment and want to be
realistic with the future of the project.

I'm open to discussion and ideas on the best way forward.

-Jacob

On Tue, Mar 30, 2021 at 2:03 PM Wes McKinney  wrote:

> hi folks,
>
> I was very surprised today 

RE: Re: sparse data array

2021-03-30 Thread Jacob Quinn
>
> > On a related note, such encoding would address DataFusion's issue of
> > representing scalars / constant arrays: a constant array would be
> > represented as a repetition. Currently we just unpack (i.e. allocate) a
> > constant array when we want to transfer through a RecordBatch.
>

In the Julia implementation, we recently merged support
 for more flexible usage of
extension types. One use-case that came up was representing the `nothing`
Julia value, which is often referred to as the "software engineer's null"
as opposed to the `missing` value, which is a propagating "data" null
value. Via extension types, we allow treating `nothing` as a "NullKind",
which means it is serialized as a null vector, with the extension type
"JuliaLang.Nothing", which allows correctly deserializing the null vector
as a `Vector{Nothing}` when reading (well, technically a
`NullVector{Nothing}`, since it's a custom array type, but hopefully you
get the point).

Anyway, all that to say that this isn't quite constant arrays, but pretty
close. You encode the constant/value in the extension type metadata. This
probably isn't a very satisfying approach for intra-language convention,
however, since I know extension types are more in the "metadata" realm
semantically.

Perhaps some generalization of the `Null` type though could be good in the
future; like a `Constant` type that has a field for the value, and a field
for the type. Then the `NullVector` encoding would be used where we just
encode the length, and no actual buffers are required to be
serialized/deserialized.

-Jacob


Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Jacob Quinn
Ah, interesting. So to make sure I understand correctly, the C++ write
implementation will scan all "batches" and unify all dictionary values
before writing out the schema + dictionary messages? But only when writing
the file format? In the streaming case, it would still write
replacement/delta dictionary messages as needed.

-Jacob

On Thu, Mar 18, 2021 at 9:10 AM Neal Richardson 
wrote:

> Somewhat related issue: https://issues.apache.org/jira/browse/ARROW-10406
>
> On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield 
> wrote:
>
> > BTW, this nuance always felt a little strange to me, but would have
> > required adding additional information to the file format, to
> disambiguate
> > when exactly a dictionary was intended to be replaced.
> >
> > On Wed, Mar 17, 2021 at 11:19 PM Micah Kornfield 
> > wrote:
> >
> > > Hi Jacob,
> > > There is nuance.  The file format does not support dictionary
> > replacement,
> > > the specification [1] why that is currently the case.  Only the "stream
> > > format" supports replacement (i.e. no magic number, only schema
> followed
> > by
> > > one or more dictionary/record-batch messages).
> > >
> > > -Micah
> > >
> > > [1] https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> > >
> > > On Wed, Mar 17, 2021 at 11:04 PM Jacob Quinn 
> > > wrote:
> > >
> > >> Had an issue come up here:
> > >>
> https://github.com/JuliaData/Arrow.jl/issues/129#issuecomment-777350450
> > .
> > >> From the implementation status page, it says C++ supports replacement
> > >> dictionaries and that python tracks the C++ implementation. Is this
> > just a
> > >> pyarrow issue where it specifically doesn't support replacement
> > >> dictionaries? Or it's not "hooked in" properly?
> > >>
> > >> -Jacob
> > >>
> > >
> >
>


No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Jacob Quinn
Had an issue come up here:
https://github.com/JuliaData/Arrow.jl/issues/129#issuecomment-777350450.
>From the implementation status page, it says C++ supports replacement
dictionaries and that python tracks the C++ implementation. Is this just a
pyarrow issue where it specifically doesn't support replacement
dictionaries? Or it's not "hooked in" properly?

-Jacob


Re: Constraints on fixed size list of variables sized types

2021-02-22 Thread Jacob Quinn
Yeah, I didn't quite follow the example either; it seems like your example
actually corresponds to a FixedSizeList[2]>[2]? Or
perhaps FixedSizeList>[2]? Assuming the former, it seems you'd
need additional fixed size slots to account for the Null element. In Julia,
you can inspect the internal structure of this like:

julia> c = [missing, ( ([0x00], [0x01, 0x02]), ([0x03, 0x04], [0x05]))]
2-element Vector{Union{Missing, Tuple{Tuple{Vector{UInt8}, Vector{UInt8}},
Tuple{Vector{UInt8}, Vector{UInt8}:
 missing
 ((UInt8[0x00], UInt8[0x01, 0x02]), (UInt8[0x03, 0x04], UInt8[0x05]))

julia> ac = Arrow.toarrowvector(c)
2-element Arrow.FixedSizeList{Union{Missing, Tuple{Tuple{Vector{UInt8},
Vector{UInt8}}, Tuple{Vector{UInt8}, Vector{UInt8,
Arrow.FixedSizeList{Tuple{Vector{UInt8}, Vector{UInt8}},
Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8},
Int32:
 missing
 ((UInt8[0x00], UInt8[0x01, 0x02]), (UInt8[0x03, 0x04], UInt8[0x05]))

# binary list data
julia> ac.data.data.data
10-element Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}:
 0x00
 0x00
 0x00
 0x00
 0x00
 0x01
 0x02
 0x03
 0x04
 0x05

# binary list offsets
julia> ac.data.data.offsets
8-element Arrow.Offsets{Int32}:
 (1, 1)
 (2, 2)
 (3, 3)
 (4, 4)
 (5, 5)
 (6, 7)
 (8, 9)
 (10, 10)

On Sun, Feb 21, 2021 at 1:38 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> We state in the spec that:
>
> A fixed size list type is specified like FixedSizeList[N], where T is
> > any type (*primitive or nested*) and N is a 32-bit signed integer
> > representing the length of the lists.
> >
>
> (emphasis mine)
>
> Now, suppose that we have FixedSizeList[2], i.e. a fixed type whose
> inner is a variable sized type, as follows
>
> [
> Null,
> [
> [[0], [1, 2]],
> [[3, 4], [5]],
> ]
> ]
>
> Looking at the offsets of the binary, two options seem possible according
> to the spec:
>
> 1. [0, 1, 3, 5, 6]  (i.e. inner has len = 4)
> 2. [0, 0, 0, 1, 3, 5, 6]  (i.e. inner has len = 6)
>
> The difference in behavior emerges whenever we want to access the values of
> the i'th slot of the fixed list, e.g. [ [[0], [1, 2]], [[3, 4], [5]] ]
> above.
>
> With option 1, we can't slice the inner using `[i * 2, (i + 1) * 2]`: for i
> = 1 this would correspond to the offsets `[3, 5, 6, out of bounds]` (the
> result would still be wrong if this was in bounds, as it excluded the
> `[[0], [1, 2]]`). In this case, we need to count the number of nulls,
> `nulls`, up to `i` and take `[(i - nulls) * 2, (i - nulls + 1) * 2]`.
>
> If we use option 2, we can slice the binary directly using `[i * 2, (i + 1)
> * 2]`: for i = 1, this would correspond to the offsets `[0, 1, 3, 5, 6]`,
> which is correct.
>
> The challenge here is that there is no way to tell whether the inner array
> fulfills this "sliceability" constraint or not. I can't find this
> constraint in the spec. Do we enforce it somewhere? Note that this behavior
> only affects FixedSizeList, but it does affect all variations whose inner
> has a variable size (List, Binary, Utf8, etc).
>
> Any ideas?
>
> Best,
> Jorge
>


Re: [ANNOUNCE] Apache Arrow 3.0.0 released

2021-01-27 Thread Jacob Quinn
Can we make sure Julia gets added to the language list in the future? ;)

On Tue, Jan 26, 2021 at 6:45 AM Krisztián Szűcs  wrote:

> The Apache Arrow community is pleased to announce the 3.0.0 release.
> The release includes 678 resolved issues ([1]) since the 2.0.0 release.
>
> The release is available now from our website, [2] and [3]:
> https://arrow.apache.org/install/
>
> Release notes are available at:
> https://arrow.apache.org/release/3.0.0.html
>
> What is Apache Arrow?
> -
>
> Apache Arrow is a cross-language development platform for in-memory data.
> It
> specifies a standardized language-independent columnar memory format for
> flat
> and hierarchical data, organized for efficient analytic operations on
> modern
> hardware. It also provides computational libraries and zero-copy streaming
> messaging and interprocess communication. Languages currently supported
> include
> C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
>
> Please report any feedback to the mailing lists ([4])
>
> Regards,
> The Apache Arrow community
>
> [1]: https://issues.apache.org/jira/projects/ARROW/versions/12348823
> [2]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-3.0.0/
> [3]: https://bintray.com/apache/arrow
> [4]: https://lists.apache.org/list.html?dev@arrow.apache.org
>


Re: [VOTE] Release Apache Arrow 3.0.0 - RC0

2021-01-16 Thread Jacob Quinn
I found a small issue with the Julia installation instructions; a PR to fix
is here: https://github.com/apache/arrow/pull/9226. With that change, the
Julia package can be installed and tests pass for me locally.

-Jacob

On Fri, Jan 15, 2021 at 3:53 PM Krisztián Szűcs  wrote:

> Hi,
>
> I would like to propose the following release candidate (RC0) of Apache
> Arrow version 3.0.0. This is a release consisting of 641
> resolved JIRA issues[1].
>
> This release candidate is based on commit:
> 0c419fcd2529d65b0cb902368ecba46a50db1622 [2]
>
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 3.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 3.0.0 because...
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%203.0.0
> [2]:
> https://github.com/apache/arrow/tree/0c419fcd2529d65b0cb902368ecba46a50db1622
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-3.0.0-rc0
> [4]: https://bintray.com/apache/arrow/centos-rc/3.0.0-rc0
> [5]: https://bintray.com/apache/arrow/debian-rc/3.0.0-rc0
> [6]: https://bintray.com/apache/arrow/python-rc/3.0.0-rc0
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/3.0.0-rc0
> [8]:
> https://github.com/apache/arrow/blob/0c419fcd2529d65b0cb902368ecba46a50db1622/CHANGELOG.md
> [9]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>


Re: Julia package

2021-01-12 Thread Jacob Quinn
Hi Krisztián,

I explained a little bit the setup [here](
https://github.com/apache/arrow/pull/9121#discussion_r554149673) recently.
We're still in a transition from the JuliaData repo to apache/arrow in
terms of development (traditionally Julia packages are their own github
repos, so users have been trained to open issues/PRs to the repo directly).
So far this has remained the easier path due to recent CI issues w/
apache/arrow and the overall low Julia traffic/review time on apache/arrow
PRs, but I think it's something we can work on.

For now, I think the easiest release for Julia is updating the installation
instructions to be:

using Pkg; Pkg.add(url="https://github.com/apache/arrow;, subdir="
julia/Arrow.jl", rev="apache-arrow-3.0.0")

Which was coincidentally just merged this morning. That will allow users to
checkout the exact release commit for the julia code in the apache/arrow
repo. And for the meantime, the rate of bugfixes has still been pretty
high, so we've been doing our own bugfix patch releases from the JuliaData
repo.
If people have ideas on how to make the process smoother, I'm open; this
seems to work for the moment, as we continue working on integrating Julia.

-Jacob

On Tue, Jan 12, 2021 at 5:20 AM Krisztián Szűcs 
wrote:

> Hi,
>
> With the upcoming release and the new julia implementation, shall we
> consider to ship the julia package from apache/arrow with the 3.0
> release?
> I'm also a bit confused since I can see some activity in the original
> repository [1], but not in the arrow repository.
>
> Thanks, Krisztian
>
> [1]: https://github.com/JuliaData/Arrow.jl/commits/main
>


Re: Github Actions feedback time

2021-01-06 Thread Jacob Quinn
>From this page, it looks like there have been certain github organizations
that have been "whitelisted" to allow their github actions to run. Is there
a process to do this whitelisting? If the `julia-actions` github org was
allowed to run, that would enable everything needed for Julia CI to run.

-Jacob

On Wed, Jan 6, 2021 at 10:00 PM Sutou Kouhei  wrote:

> Hi,
>
> > I wasn't following the build queue's state lately, but I think we
> > should consolidate the build configurations.
> > Possible candidates are the PR* workflows
>
> https://github.com/apache/arrow/pull/9120
>
>
> Thanks,
> --
> kou
>
> In 
>   "Github Actions feedback time" on Tue, 5 Jan 2021 13:33:38 +0100,
>   Krisztián Szűcs  wrote:
>
> > Hi,
> >
> > I'm concerned about the overall feedback time we have on pull requests.
> > I have a simple PR to make the comment bot working again, but no
> > builds are running even after 30 minutes.
> > I can see 2-4 running builds, which will make our work much harder
> > right before the release.
> >
> > I wasn't following the build queue's state lately, but I think we
> > should consolidate the build configurations.
> > Possible candidates are the PR* workflows and good to have tests which
> > we could trigger on master instead.
> >
> > Opinions?
> >
> > Regards, Krisztian
>


Re: Dictionary key access in python/generally

2020-10-07 Thread Jacob Quinn
>
> But I'm also attaching table
> metadata to each feather, which I'd hate to lose.
>

Note the arrow format allows attaching custom metadata at the column
(field), schema, and message level, so it should be possible to retain any
metadata this way.

-Jacob

On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt <
bmschm...@gmail.com> wrote:

> Hello,
>
> Exciting project, thanks for all your work. I gather it's appropriate to
> ask a use question here? Assuming so:
>
> I have a web application that serves portions of a dataset I've broken into
> a few thousand featherV2 files structured as a quadtree. The structure
> makes heavy use of text dictionary types; I'd like to have each dictionary
> integer map to the same string across all files so that I can ship the data
> for each tile straight to GPU without decoding the text.
>
> If you slice a portion of a pandas categorical array and coerce to an arrow
> dictionary, you keep the underlying pandas integer encoding; for example,
> the last line here shows a dictionary with four keys even though the table
> has just one row.
>
> ```
> import pandas as pd
> import pyarrow as pa
> pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
> pa.Array.from_pandas(pandas_cat[2:3])
> ```
>
> For my purposes, this is good! But of course it's wasteful, too. So I'm
> wondering:
>
> 1. Whether it's safe to count on the above code continuing to use the
> internal pandas keys in the arrow output, or whether at some point it might
> redo the pandas encoding in a more efficient way;
> 2. Whether there's a native pyarrow way to ensure that multiple feather
> dictionaries across files use the same integer identifiers for all the keys
> that they share.
>
> I can see that the right way here might be to use the IPC streaming format
> rather than feather, and send out a single schema for the dataset, with
> dictionary batches identifying the keys. But I'm also attaching table
> metadata to each feather, which I'd hate to lose.
>
> --
> Benjamin Schmidt
> Director of Digital Humanities and Clinical Associate Professor of History
> 20 Cooper Square, Room 538
> New York University
>
> 
> benschmidt.org
>


Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Jacob Quinn
My immediate thought reading the discussion points was Julia's task-based
multithreading model that has been part of the language for over a year
now. An announcement blogpost for Julia 1.3 laid out some of the details
and high-level approach: https://julialang.org/blog/2019/07/multithreading/,
and the multithreading code was marked stable in the recent 1.5 release.

Kiran, one of the main contributors to the threading model in Julia, worked
on a separate C-based repo for the core functionality (
https://github.com/kpamnany/partr), but I think the latest code is embedded
in the Julia source code now.

Anyway, probably most useful as a reference, but Jameson (cc'd) also does
weekly multithreading chats (on Wednesdays), so I imagine he wouldn't mind
chatting about things if desired.

-Jacob

On Tue, Sep 15, 2020 at 8:17 PM Weston Pace  wrote:

> My C++ is pretty rusty but I'll see if I can come up with a concrete
> CSV example / experiment / proof of concept on Friday when I have a
> break from work.
>
> On Tue, Sep 15, 2020 at 3:47 PM Wes McKinney  wrote:
> >
> > On Tue, Sep 15, 2020 at 7:54 PM Weston Pace 
> wrote:
> > >
> > > Yes.  Thank you.  I am in agreement with you and futures/callbacks are
> > > one such "richer programming model for
> > > hierarchical work scheduling".
> > >
> > > A scan task with a naive approach is:
> > >
> > > workers = partition_files_list(files_list)
> > > for worker in workers:
> > > start_thread(worker)
> > > for worker in workers:
> > > join_thread(worker)
> > > return aggregate_results()
> > >
> > > You have N+1 threads because you have N worker threads and 1 scan
> > > thread.  There is the potential for deadlock if your thread pool only
> > > has one remaining spot and it is given to the scan thread.
> > >
> > > On the other hand, with a futures based approach you have:
> > >
> > > futures = partition_files_list(files_list)
> > > return when_all(futures).do(aggregate_results)
> > >
> > > There are only N threads.  The scan thread goes away.  In fact, if all
> > > of your underlying OS/FS libraries are non-blocking then you can
> > > completely eliminate threads in the waiting state and an entire
> > > category of deadlocks are no longer a possibility.
> >
> > I don't quite follow. I think it would be most helpful to focus on a
> > concrete practical matter like reading Parquet or CSV files in
> > parallel (which can be go faster through parallelism at the single
> > file level) and devise a programming model in C++ that is different
> > from what we are currently doing that results in superior CPU
> > utilization.
> >
> >
> > >
> > > -Weston
> > >
> > > On Tue, Sep 15, 2020 at 1:21 PM Wes McKinney 
> wrote:
> > > >
> > > > hi Weston,
> > > >
> > > > We've discussed some of these problems in the past -- I was
> > > > enumerating some of these issues to highlight the problems that are
> > > > resulting from an absence of a richer programming model for
> > > > hierarchical work scheduling. Parallel tasks originating in each
> > > > workload are submitted to a global thread pool where they are
> > > > commingled with the tasks coming from other workloads.
> > > >
> > > > As an example of how this can go wrong, suppose we have a static
> > > > thread pool with 4 executors. If we submit 4 long-running tasks to
> the
> > > > pool, and then each of these tasks spawn additional tasks that go
> into
> > > > the thread pool, a deadlock can occur, because the thread pool thinks
> > > > that it's executing tasks when in fact those tasks are waiting on
> > > > their dependent tasks to complete.
> > > >
> > > > A similar resource underutilization occurs when we do
> > > > pool->Submit(ReadFile), where ReadFile needs to do some IO -- from
> the
> > > > thread pool's perspective, the task is "working" even though it may
> > > > wait for one or more IO calls to complete.
> > > >
> > > > In the Datasets API in C++ we have both of these problems: file scan
> > > > tasks are being pushed onto the global thread pool, and so to prevent
> > > > deadlocks multithreaded file parsing has been disabled. Additionally,
> > > > the scan tasks do IO, resulting in suboptimal performance (the
> > > > problems caused by this will be especially exacerbated when running
> > > > against slower filesystems like Amazon S3)
> > > >
> > > > Hopefully the issues are more clear.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Tue, Sep 15, 2020 at 2:57 PM Weston Pace 
> wrote:
> > > > >
> > > > > It sounds like you are describing two problems.
> > > > >
> > > > > 1) Idleness - Tasks are holding threads in the thread pool while
> they
> > > > > wait for IO or some long running non-CPU task to complete.  These
> > > > > threads are often in a "wait" state or something similar.
> > > > > 2) Fairness - The ordering of tasks is causing short tasks that
> could
> > > > > be completed quickly from being stuck behind longer term tasks.
> > > > > Fairness can be an issue even if all 

Re: Compression?

2020-09-15 Thread Jacob Quinn
Ah, that's where it was.

Ok, so if I understand correctly, individual buffers are compressed, and in
the Buffer struct, the buffer length is the _compressed_ length? And when
written, the _uncompressed_ length is first written in 8 bytes, then the
compressed buffer?

What's the general strategy for dealing with compressed buffers? Uncompress
the whole thing when deserializing a compressed buffer? Or is decompressing
delayed until individual elements are accessed? I'm guessing the former
since it doesn't seem like you'd be able to do random-access into a
compressed buffer?

-Jacob

On Tue, Sep 15, 2020 at 6:23 PM Wes McKinney  wrote:

> We have protocol-level compression for message body buffers [1][2]
> with LZ4 or ZSTD
>
> In-memory compression and encoding other than dictionary encoding
> (like RLE) has been discussed multiple times and remains on the
> roadmap for the project.
>
> [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L45
>
> On Tue, Sep 15, 2020 at 7:18 PM Jacob Quinn 
> wrote:
> >
> > Am I correct in understanding there's nothing in the arrow ipc/file
> format
> > spec about compression? I thought I had seen something at one point, but
> > looking over the spec website, I don't see anything.
> >
> > -Jacob
>


Compression?

2020-09-15 Thread Jacob Quinn
Am I correct in understanding there's nothing in the arrow ipc/file format
spec about compression? I thought I had seen something at one point, but
looking over the spec website, I don't see anything.

-Jacob


Julia implementation and integration with main apache arrow repository

2020-09-13 Thread Jacob Quinn
Hello all,

Hopefully this email works (I'm not super familiar with using mailing lists
like this).

Over the past few weeks, I've been working on a pure Julia implementation
to support serializing/deserializing the arrow format for Julia. The code
in its current state can be found here:
https://github.com/JuliaData/Arrow.jl.

I believe the code has reached an initial beta-level quality and just
finished writing the arrow <-> json integration testing code that archery
expects. I haven't worked on actual archery integration yet, but it should
just be a matter of adding a tester_julia.py file that knows how to invoke
the test/integrationtest.jl file with similar arguments as the tester_go.py
file.

This email has a couple purposes:
* Signal that the julia code is somewhat ready to be used/integrated in the
main repo
* Ask for advice/direction on actually integrating with the apache arrow
github repository

For the latter, in particular, I imagine keeping an initial PR as minimal
as possible is desirable. I need to follow up with the core pkg devs for
Julia, but I've been told it's possible/not hard to have a Julia package
"live" inside a monorepo, but I just haven't figured out the details of
what that means on the Julia General package registry side of things. But
I'm happy to figure that out and shouldn't really affect the merging of
Julia code into the apache arrow github.

So my plan is roughly:
* Fork/make a branch of the apache arrow repo
* Add in the Julia code from the link I mentioned above
* Add necessary files/integration in archery to run Julia integration tests
alongside other languages
* Do initial merge into apache arrow?

If there are other initial requirements core devs would expect, just let me
know, but I imagine that updating the implementation matrix, for example,
can be done afterwards as follow up.

Excited to have Julia more officially integrated here!

Cheers,

-Jacob
https://github.com/quinnj
https://twitter.com/quinn_jacobd


Fwd: How to concatenate RecordBatches into a single RecordBatch?

2018-08-27 Thread Jacob Quinn Shenker
Hi all,

Question: If I have a set of small (10-1000 rows) RecordBatches on
disk or in memory, how can I (efficiently) concatenate/rechunk them
into larger RecordBatches (so that each column is output as a
contiguous array when written to a new Arrow buffer)?

Context: With such small RecordBatches, I'm finding that reading Arrow
into a pandas table is very slow (~100x slower than local disk) from
my cluster's Lustre distributed file system (plenty of bandwidth but
each IO op has very high latency); I'm assuming this has to do with
needing many seek() calls for each RecordBatch. I'm hoping it'll help
if I rechunk my data into larger RecordBatches before writing to disk.
(The input RecordBatches are small because they are the individual
results returned by millions of tasks on a dask cluster, as part of a
streaming analysis pipeline.)

While I'm here I also wanted to thank everyone on this list for all
their work on Arrow! I'm a PhD student in biology at Harvard Medical
School. We take images of about 1 billion individual bacteria every
day with our microscopes, generating about ~1PB/yr in raw data. We're
using this data to search for new kinds of antibiotic drugs. Using way
more data allows us precisely measure how the bacteria's growth is
affected by the drug candidates, which allows us to find new drugs
that previous screens have missed—and that's why I'm really excited
about Arrow, it's making dealing with these data volumes a lot easier
for us!

~ J


Question on Exactness of Arrow Memory Layout

2016-06-01 Thread Jacob Quinn
Having become familiar with the Arrow memory layout, and taking a stab at
an implementation in the Julia language, I've come up with a perhaps naive
question.

A "type" (class) I have defined so far is:

immutable Column{T} <: ArrowColumn{T}
buffer::Vector{UInt8} # potential reference to mmap
length::Int32
null_count::Int32
nulls::BitVector # null == 0 == false, not-null == 1 == true; always
padded to 64-byte alignments
values::Vector{T} # always padded to 64-byte alignments
end


which aims to be an array/column that holds any "primitive" bits type `T`.
Note the exact layout matching with "length", "null_count", "nulls", and
"values".

The additional reference, however, is the "buffer" field, which holds a
reference to a byte buffer. This would be technically optional if the
`nulls` and `values` fields owned their own memory, but there are other
cases where `buffer` would own, for example, memory-mapped bytes that
`nulls` and `values` would be sharing.

My question is if this somehow "violates" the Arrow memory layout by
including this additional `buffer` reference in my class?

It begs a larger question of what exactly the inter-language "API" looks
like. I'm assuming it's not as strict as needing to be able to pass a
pointer to another process that would be able to auto-wrap as it's own
Arrow structure; but I think I read somewhere that it IS aiming for some
kind of "memcpy" operation. Any light anyone can shed would be most
welcome; help me know if I'm perhaps over-thinking this at this stage.

-Jacob