Re: [VOTE] Release Apache Arrow 0.14.0 - RC0
+1 (binding) Thanks Kou for adding the missing signatures. * I was able to verify the binaries after the signature fix. The Linux package tests are very nice! * I ran the following source verifications (on linux except where noted) * C++ (Ubuntu 19.04 and Windows, with patch https://github.com/apache/arrow/pull/4770) * Python (UB19.04 / Windows) * Java * JS * Ruby * GLib * Go * Rust * Integration tests with Flight (with minor patch https://github.com/apache/arrow/pull/4775) I only had trouble with C#, and it may be environment specific. On Mon, Jul 1, 2019 at 4:32 PM Sutou Kouhei wrote: > > Hi, > > > but it failed with > > > > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a > > Thanks for catching this. > I failed to upload some files. I uploaded missing files. > > I confirmed that there are no missing files with the > following Ruby script: > > -- > #!/usr/bin/env ruby > > require "open-uri" > require "json" > require "English" > > ["debian", "ubuntu", "centos", "python"].each do |target| > json_path = "/tmp/#{target}-file-list.json" > unless File.exist?(json_path) > > open("https://bintray.com/api/v1/packages/apache/arrow/#{target}-rc/versions/0.14.0-rc0/files";) > do |input| > File.open(json_path, "w") do |json| > IO.copy_stream(input, json) > end > end > end > > source_paths = [] > asc_paths = [] > sha512_paths = [] > JSON.parse(File.read(json_path)).each do |entry| > path = entry["path"] > case path > when /\.asc\z/ > asc_paths << $PREMATCH > when /\.sha512\z/ > sha512_paths << $PREMATCH > else > source_paths << path > end > end > pp([:no_asc, source_paths - asc_paths]) > pp([:no_source_for_asc, asc_paths - source_paths]) > pp([:no_sha512, source_paths - sha512_paths]) > pp([:no_source_for_sha512, sha512_paths - source_paths]) > end > -- > > But this is a bit strange. Download file list is read from > Bintray (*). So I think that our verification script doesn't > try downloading nonexistent files... > > (*) > https://bintray.com/api/v1/packages/apache/arrow/debian-rc/versions/0.14.0-rc0/files > > > I'm going to work on verifying more components. C# is failing with > > > > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f > > I couldn't reproduce this on my environment. > I'll try this with clean environment. > > Note that we can try only C# verification with the following > command line: > > TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1 > dev/release/verify-release-candidate.sh source 0.14.0 0 > > > Seems like we might need to find an > > artifact staging solution that is not Bintray if API rate limits are > > going to be a problem. > > I don't get response yet from https://bintray.com/apache > organization. I'll open an issue on INFRA JIRA. > > > Thanks, > -- > kou > > In > "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Mon, 1 Jul 2019 11:48:50 > -0500, > Wes McKinney wrote: > > > hi Antoine, I'm not sure the origin of the conda.sh failure, have you > > tried removing any bashrc stuff related to the Anaconda distribution > > that you develop against? > > > > With the following patch I'm able to run the binary verification > > > > https://github.com/apache/arrow/pull/4768 > > > > but it failed with > > > > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a > > > > Indeed a sig is missing from bintray. I was able to get the parallel > > build to run on my machine (but it failed when I piped stdin/stdout to > > a file) but I also found a bad sig > > > > https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5 > > > > I'm going to work on verifying more components. C# is failing with > > > > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f > > > > but I don't think that should block the release (it would be nice if > > it passed though) > > > > I'm going to work on the Windows verification script and see if I can > > add Flight support to it > > > > All in all appears that an RC1 may be warranted unless the signature > > issues can be remedied in RC0. Seems like we might need to find an > > artifact staging solution that is not Bintray if API rate limits are > > going to be a problem. > > > > - Wes > > > > On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou wrote: > >> > >> > >> On Ubuntu 18.04: > >> > >> - failed to verify binaries > >> > >> """ > >> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU > >> for details.' > >> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for > >> details. > >> """ > >> > >> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of > >> zombie curl processes running... > >> > >> - failed to verify sources > >> > >> """ > >> + export PATH > >> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: > >> line 55: PS1: unbound variable > >> + ask_conda= > >> + return 1 > >> + cleanup > >> + '[' no = yes ']' > >> + echo
Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
Hi Wes, Thanks for your response. In regards to the protocol negotiation your description of feature reporting (snipped below) is along the lines of what I was thinking. It might not be necessary for 1.0.0, but at some point might become useful. > Note that we don't really have a mechanism for clients > and servers to report to each other what features they support, so > this could help with that when for applications where it might matter. Thanks, Micah On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney wrote: > hi Micah, > > Sorry for the delay in feedback. I looked at the document and it seems > like a reasonable perspective about forward- and > backward-compatibility. > > It seems like the main thing you are proposing is to apply Semantic > Versioning to Format and Library versions separately. That's an > interesting idea, my thought had been to have a version number that is > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is > more flexible in some ways, so let me clarify for others reading > > In what you are proposing, the next release would be: > > Format version: 1.0.0 > Library version: 1.0.0 > > Suppose that 20 major versions down the road we stand at > > Format version: 1.5.0 > Library version: 20.0.0 > > The minor version of the Format would indicate that there are > additions, like new elements in the Type union, but otherwise backward > and forward compatible. So the Minor version means "new things, but > old clients will not be disrupted if those new things are not used". > We've already been doing this since the V4 Format iteration but we > have not had a way to signal that there may be new features. As a > corollary to this, I wonder if we should create a dual version in the > metadata > > PROTOCOL VERSION: (what is currently MetadataVersion, V2) > FEATURE VERSION: not tracked at all > > So Minor version bumps in the format would trigger a bump in the > FeatureVersion. Note that we don't really have a mechanism for clients > and servers to report to each other what features they support, so > this could help with that when for applications where it might matter. > > Should backward/forward compatibility be disrupted in the future, then > a change to the major version would be required. So in year 2025, say, > we might decide that we want to do: > > Format version: 2.0.0 > Library version: 21.0.0 > > The Format version would live in the project's Documentation, so the > Apache releases are only the library version. > > Regarding your open questions: > > 1. Should we clean up "warts" on the specification, like redundant > information > > I don't think it's necessary. So if Metadata V5 is Format Version > 1.0.0 (currently we are V4, but we're discussing some possible > non-forward compatible changes...) I think that's OK. None of these > things are "hurting" anything > > 2. Do we need additional mechanisms for marking some features as > experimental? > > Not sure, but I think this can be mostly addressed through > documentation. Flight will still be experimental in 1.0.0, for > example. > > 3. Do we need protocol negotiation mechanisms in Flight > > Could you explain what you mean? Are you thinking if there is some > major revamp of the protocol and you need to switch between a "V1 > Flight Protocol" and a "V2 Flight Protocol"? > > - Wes > > On Thu, Jun 13, 2019 at 2:17 AM Micah Kornfield > wrote: > > > > Hi Everyone, > > I think there might be some ideas that we still need to reach consensus > on > > for how the format and libraries evolve in a post-1.0.0 release world. > > Specifically, I think we need to agree on definitions for > > backwards/forwards compatibility and its implications for versioning the > > format. > > > > To this end I put some thoughts down in a Google Doc [1] for the purposes > > of discussion. Comments welcome. I will start threads for any comments > in > > the document that seem to warrant further discussion, and once we reach > > consensus I can create a patch to document what we decide on as part of > the > > specification. > > > > Thanks, > > Micah > > > > [1] > > > https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit# >
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
Thanks for the references. If we decided to make a change around this, we could call the first 4 bytes a stream continuation marker to make it slightly less ugly * 0x: continue * 0x: stop On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield wrote: > > Hi Wes, > I'm not an expert on this either, my inclination mostly comes from some > research I've done. I think it is important to distinguish two cases: > 1. unaligned access at the processor instruction level > 2. undefined behavior > > From my reading unaligned access is fine on most modern architectures and it > seems the performance penalty has mostly been eliminated. > > Undefined behavior is a compiler/language concept. The problem is the > compiler can choose to do anything in UB scenarios, not just the "obvious" > translation. Specifically, the compiler is under no obligation to generate > the unaligned access instructions, and if it doesn't SEGVs ensue. Two > examples, both of which relate to SIMD optimizations are linked below. > > I tend to be on the conservative side with this type of thing but if we have > experts on the the ML that can offer a more informed opinion, I would love to > hear it. > > [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html > [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 > > On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney wrote: >> >> The <0x> solution is downright ugly but I think >> it's one of the only ways that achieves >> >> * backward compatibility (new clients can read old data) >> * opt-in forward compatibility (if we want to go to the labor of doing >> so, sort of dangerous) >> * old clients receiving new data do not blow up (they will see a >> metadata length of -1) >> >> NB 0x would look like: >> >> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32) >> Out[13]: array([4294967295,128], dtype=uint32) >> >> In [14]: np.array([(2 << 32) - 1, 128], >> dtype=np.uint32).view(np.int32) >> Out[14]: array([ -1, 128], dtype=int32) >> >> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8) >> Out[15]: array([255, 255, 255, 255, 128, 0, 0, 0], dtype=uint8) >> >> Flatbuffers are 32-bit limited so we don't need all 64 bits. >> >> Do you know in what circumstances unaligned reads from Flatbuffers >> might cause an issue? I do not know enough about UB but my >> understanding is that it causes issues on some specialized platforms >> where for most modern x86-64 processors and compilers it is not really >> an issue (though perhaps a performance issue) >> >> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield >> wrote: >> > >> > At least on the read-side we can make this detectable by using something >> > like <0x> instead of int64_t. On the write side we >> > would need some sort of default mode that we could flip on/off if we >> > wanted to maintain compatibility. >> > >> > I should say I think we should fix it. Undefined behavior is unpaid debt >> > that might never be collected or might cause things to fail in difficult >> > to diagnose ways. And pre-1.0.0 is definitely the time. >> > >> > -Micah >> > >> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney wrote: >> >> >> >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney wrote: >> >> > >> >> > hi Micah, >> >> > >> >> > This is definitely unfortunate, I wish we had realized the potential >> >> > implications of having the Flatbuffer message start on a 4-byte >> >> > (rather than 8-byte) boundary. The cost of making such a change now >> >> > would be pretty high since all readers and writers in all languages >> >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version >> >> > bump is the last opportunity we have to make a change like this, so we >> >> > might as well discuss it now. Note that particular implementations >> >> > could implement compatibility functions to handle the 4 to 8 byte >> >> > change so that old clients can still be understood. We'd probably want >> >> > to do this in C++, for example, since users would pretty quickly >> >> > acquire a new pyarrow version in Spark applications while they are >> >> > stuck on an old version of the Java libraries. >> >> >> >> NB such a backwards compatibility fix would not be forward-compatible, >> >> so the PySpark users would need to use a pinned version of pyarrow >> >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK >> >> >> >> > >> >> > - Wes >> >> > >> >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield >> >> > wrote: >> >> > > >> >> > > While working on trying to fix undefined behavior for unaligned memory >> >> > > accesses [1], I ran into an issue with the IPC specification [2] which >> >> > > prevents us from ever achieving zero-copy memory mapping and having >> >> > > aligned >> >> > > accesses (i.e. clean UBSan runs). >> >> > > >> >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned >> >> > > accesses. >> >> > > >> >> > > In the IPC format we align ea
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
Hi Wes, I'm not an expert on this either, my inclination mostly comes from some research I've done. I think it is important to distinguish two cases: 1. unaligned access at the processor instruction level 2. undefined behavior >From my reading unaligned access is fine on most modern architectures and it seems the performance penalty has mostly been eliminated. Undefined behavior is a compiler/language concept. The problem is the compiler can choose to do anything in UB scenarios, not just the "obvious" translation. Specifically, the compiler is under no obligation to generate the unaligned access instructions, and if it doesn't SEGVs ensue. Two examples, both of which relate to SIMD optimizations are linked below. I tend to be on the conservative side with this type of thing but if we have experts on the the ML that can offer a more informed opinion, I would love to hear it. [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709 On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney wrote: > The <0x> solution is downright ugly but I think > it's one of the only ways that achieves > > * backward compatibility (new clients can read old data) > * opt-in forward compatibility (if we want to go to the labor of doing > so, sort of dangerous) > * old clients receiving new data do not blow up (they will see a > metadata length of -1) > > NB 0x would look like: > > In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32) > Out[13]: array([4294967295,128], dtype=uint32) > > In [14]: np.array([(2 << 32) - 1, 128], > dtype=np.uint32).view(np.int32) > Out[14]: array([ -1, 128], dtype=int32) > > In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8) > Out[15]: array([255, 255, 255, 255, 128, 0, 0, 0], dtype=uint8) > > Flatbuffers are 32-bit limited so we don't need all 64 bits. > > Do you know in what circumstances unaligned reads from Flatbuffers > might cause an issue? I do not know enough about UB but my > understanding is that it causes issues on some specialized platforms > where for most modern x86-64 processors and compilers it is not really > an issue (though perhaps a performance issue) > > On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield > wrote: > > > > At least on the read-side we can make this detectable by using something > like <0x> instead of int64_t. On the write side we > would need some sort of default mode that we could flip on/off if we wanted > to maintain compatibility. > > > > I should say I think we should fix it. Undefined behavior is unpaid > debt that might never be collected or might cause things to fail in > difficult to diagnose ways. And pre-1.0.0 is definitely the time. > > > > -Micah > > > > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney > wrote: > >> > >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney > wrote: > >> > > >> > hi Micah, > >> > > >> > This is definitely unfortunate, I wish we had realized the potential > >> > implications of having the Flatbuffer message start on a 4-byte > >> > (rather than 8-byte) boundary. The cost of making such a change now > >> > would be pretty high since all readers and writers in all languages > >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version > >> > bump is the last opportunity we have to make a change like this, so we > >> > might as well discuss it now. Note that particular implementations > >> > could implement compatibility functions to handle the 4 to 8 byte > >> > change so that old clients can still be understood. We'd probably want > >> > to do this in C++, for example, since users would pretty quickly > >> > acquire a new pyarrow version in Spark applications while they are > >> > stuck on an old version of the Java libraries. > >> > >> NB such a backwards compatibility fix would not be forward-compatible, > >> so the PySpark users would need to use a pinned version of pyarrow > >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK > >> > >> > > >> > - Wes > >> > > >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield < > emkornfi...@gmail.com> wrote: > >> > > > >> > > While working on trying to fix undefined behavior for unaligned > memory > >> > > accesses [1], I ran into an issue with the IPC specification [2] > which > >> > > prevents us from ever achieving zero-copy memory mapping and having > aligned > >> > > accesses (i.e. clean UBSan runs). > >> > > > >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > accesses. > >> > > > >> > > In the IPC format we align each message to 8-byte boundaries. We > then > >> > > write a int32_t integer to to denote the size of flat buffer > metadata, > >> > > followed immediately by the flatbuffer metadata. This means the > >> > > flatbuffer metadata will never be 8 byte aligned. > >> > > > >> > > Do people care? A simple fix would be to use int64_t instead of > int32_t > >> > > for length. However, any fix essential
Re: [VOTE] Release Apache Arrow 0.14.0 - RC0
Hi, > but it failed with > > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a Thanks for catching this. I failed to upload some files. I uploaded missing files. I confirmed that there are no missing files with the following Ruby script: -- #!/usr/bin/env ruby require "open-uri" require "json" require "English" ["debian", "ubuntu", "centos", "python"].each do |target| json_path = "/tmp/#{target}-file-list.json" unless File.exist?(json_path) open("https://bintray.com/api/v1/packages/apache/arrow/#{target}-rc/versions/0.14.0-rc0/files";) do |input| File.open(json_path, "w") do |json| IO.copy_stream(input, json) end end end source_paths = [] asc_paths = [] sha512_paths = [] JSON.parse(File.read(json_path)).each do |entry| path = entry["path"] case path when /\.asc\z/ asc_paths << $PREMATCH when /\.sha512\z/ sha512_paths << $PREMATCH else source_paths << path end end pp([:no_asc, source_paths - asc_paths]) pp([:no_source_for_asc, asc_paths - source_paths]) pp([:no_sha512, source_paths - sha512_paths]) pp([:no_source_for_sha512, sha512_paths - source_paths]) end -- But this is a bit strange. Download file list is read from Bintray (*). So I think that our verification script doesn't try downloading nonexistent files... (*) https://bintray.com/api/v1/packages/apache/arrow/debian-rc/versions/0.14.0-rc0/files > I'm going to work on verifying more components. C# is failing with > > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f I couldn't reproduce this on my environment. I'll try this with clean environment. Note that we can try only C# verification with the following command line: TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1 dev/release/verify-release-candidate.sh source 0.14.0 0 > Seems like we might need to find an > artifact staging solution that is not Bintray if API rate limits are > going to be a problem. I don't get response yet from https://bintray.com/apache organization. I'll open an issue on INFRA JIRA. Thanks, -- kou In "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Mon, 1 Jul 2019 11:48:50 -0500, Wes McKinney wrote: > hi Antoine, I'm not sure the origin of the conda.sh failure, have you > tried removing any bashrc stuff related to the Anaconda distribution > that you develop against? > > With the following patch I'm able to run the binary verification > > https://github.com/apache/arrow/pull/4768 > > but it failed with > > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a > > Indeed a sig is missing from bintray. I was able to get the parallel > build to run on my machine (but it failed when I piped stdin/stdout to > a file) but I also found a bad sig > > https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5 > > I'm going to work on verifying more components. C# is failing with > > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f > > but I don't think that should block the release (it would be nice if > it passed though) > > I'm going to work on the Windows verification script and see if I can > add Flight support to it > > All in all appears that an RC1 may be warranted unless the signature > issues can be remedied in RC0. Seems like we might need to find an > artifact staging solution that is not Bintray if API rate limits are > going to be a problem. > > - Wes > > On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou wrote: >> >> >> On Ubuntu 18.04: >> >> - failed to verify binaries >> >> """ >> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU >> for details.' >> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details. >> """ >> >> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of >> zombie curl processes running... >> >> - failed to verify sources >> >> """ >> + export PATH >> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: >> line 55: PS1: unbound variable >> + ask_conda= >> + return 1 >> + cleanup >> + '[' no = yes ']' >> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X >> for details.' >> Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details. >> """ >> >> There's no details in /tmp/arrow-0.14.0.yum2X >> >> Regards >> >> Antoine. >> >> >> >> >> >> Le 01/07/2019 à 07:32, Sutou Kouhei a écrit : >> > Hi, >> > >> > I would like to propose the following release candidate (RC0) of Apache >> > Arrow version 0.14.0. This is a release consiting of 618 >> > resolved JIRA issues[1]. >> > >> > This release candidate is based on commit: >> > a591d76ad9a657110368aa422bb00f4010cb6b6e [2] >> > >> > The source release rc0 is hosted at [3]. >> > The binary artifacts are hosted at [4][5][6][7]. >> > The changelog is located at [8]. >> > >> > Please download, verify checksums and signatures, run the unit tests, >> > and vote on the release. See [9] for how to validate a release ca
Re: [VOTE] Release Apache Arrow 0.14.0 - RC0
Hi, Thanks for verifying this RC. > - failed to verify binaries > > """ > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU > for details.' > Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details. > """ > > There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of > zombie curl processes running... It seems that one of curl downloads is failed. Parallel download may be fragile. https://github.com/apache/arrow/pull/4768 by Wes will solve this situation. I've merged this. > - failed to verify sources > > """ > + export PATH > /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: > line 55: PS1: unbound variable https://github.com/apache/arrow/pull/4773 will solve this. I added "set -u" to detect using undefined variables caused by typo in https://github.com/apache/arrow/commit/9a788dfc976035cabb0d4ab15f0f6fa306a5428d . It works well on my environment. But I understand that it's not portable with shell script that sources external shell script. (. $MINICONDA/etc/profile.d/conda.sh) I've removed "set -u" by https://github.com/apache/arrow/commit/9145c1591aedbd141454cfc7b6aad5190c0fb30e . Thanks, -- kou In <03cca1d7-7f8f-c46c-2360-132cd300c...@python.org> "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Mon, 1 Jul 2019 10:48:18 +0200, Antoine Pitrou wrote: > > On Ubuntu 18.04: > > - failed to verify binaries > > """ > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU > for details.' > Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details. > """ > > There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of > zombie curl processes running... > > - failed to verify sources > > """ > + export PATH > /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: > line 55: PS1: unbound variable > + ask_conda= > + return 1 > + cleanup > + '[' no = yes ']' > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X > for details.' > Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details. > """ > > There's no details in /tmp/arrow-0.14.0.yum2X > > Regards > > Antoine. > > > > > > Le 01/07/2019 à 07:32, Sutou Kouhei a écrit : >> Hi, >> >> I would like to propose the following release candidate (RC0) of Apache >> Arrow version 0.14.0. This is a release consiting of 618 >> resolved JIRA issues[1]. >> >> This release candidate is based on commit: >> a591d76ad9a657110368aa422bb00f4010cb6b6e [2] >> >> The source release rc0 is hosted at [3]. >> The binary artifacts are hosted at [4][5][6][7]. >> The changelog is located at [8]. >> >> Please download, verify checksums and signatures, run the unit tests, >> and vote on the release. See [9] for how to validate a release candidate. >> >> NOTE: You must use verify-release-candidate.sh at master. >> I've fixed some problems after apache-arrow-0.14.0 tag. >> C#'s "sourcelink test" is fragile. (Network related problem?) >> It may be better that we add retry logic to "sourcelink test". >> >> The vote will be open for at least 72 hours. >> >> [ ] +1 Release this as Apache Arrow 0.14.0 >> [ ] +0 >> [ ] -1 Do not release this as Apache Arrow 0.14.0 because... >> >> [1]: >> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0 >> [2]: >> https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e >> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0 >> [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0 >> [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0 >> [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0 >> [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0 >> [8]: >> https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md >> [9]: >> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates >> >> >> Thanks, >> -- >> kou >>
Re: Spark and Arrow Flight
On Mon, Jul 1, 2019 at 3:50 PM David Li wrote: > > I think I'd prefer #3 over overloading an existing call (#2). > > We've been thinking about a similar issue, where sometimes we want > just the schema, but the service can't necessarily return the schema > without fetching data - right now we return a sentinel value in > GetFlightInfo, but a separate RPC would let us explicitly indicate an > error. > > I might be missing something though - what happens between step 1 and > 2 that makes the endpoints available? Would it make sense to use > DoAction to cause the backend to "prepare" the endpoints, and have the > result of that be an encoded schema? So then the flow would be > DoAction -> GetFlightInfo -> DoGet. I think it depends on the particular server/planner implementation. If preparing a dataset is expensive (imagine loading a large dataset into a distributed cache, then dropping it later), then it might be that you have: DoAction: Load/Prepare $DATASET ... clients access the dataset using GetFlightInfo with path $DATASET DoAction: Drop $DATASET In other cases GetFlightInfo might contain a SQL query and so having a separate DoAction workflow is not needed > > Best, > David > > On 7/1/19, Wes McKinney wrote: > > My inclination is either #2 or #3. #4 is an option of course, but I > > like the more structured solution of explicitly requesting the schema > > given a descriptor. > > > > In both cases, it's possible that schemas are sent twice, e.g. if you > > call GetSchema and then later call GetFlightInfo and so you receive > > the schema again. The schema is optional, so if it became a > > performance problem then a particular server might return the schema > > as null from GetFlightInfo. > > > > I think it's valid to want to make a single GetFlightInfo RPC request > > that returns _both_ the schema and the query plan. > > > > Thoughts from others? > > > > On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau wrote: > >> > >> My initial inclination is towards #3 but I'd be curious what others > >> think. > >> In the case of #3, I wonder if it makes sense to then pull the Schema off > >> the GetFlightInfo response... > >> > >> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray wrote: > >> > >> > Hi All, > >> > > >> > I have been working on building an arrow flight source for spark. The > >> > goal > >> > here is for Spark to be able to use a group of arrow flight endpoints > >> > to > >> > get a dataset pulled over to spark in parallel. > >> > > >> > I am unsure of the best model for the spark <-> flight conversation and > >> > wanted to get your opinion on the best way to go. > >> > > >> > I am breaking up the query to flight from spark into 3 parts: > >> > 1) get the schema using GetFlightInfo. This is needed to do further > >> > lazy > >> > operations in Spark > >> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a > >> > different > >> > argument. This returns the list endpoints on the parallel flight > >> > server. > >> > The endpoints are not available till data is ready to be fetched, which > >> > is > >> > done after the schema but is needed before DoGet is called. > >> > 3) call get stream on all endpoints from 2 > >> > > >> > I think I have to do each step however I don't like having to call > >> > getInfo > >> > twice, it doesn't seem very elegant. I see a few options: > >> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to > >> > differentiate the purpose of each call > >> > 2) add an argument to GetFlightInfo to tell it its being called only > >> > for > >> > the schema > >> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return > >> > just > >> > the Schema in question > >> > 4) use DoAction and wrap the expected FlightInfo in a Result > >> > > >> > I am aware that 4 is probably the least disruptive but I'm also not a > >> > fan > >> > as (to me) it implies performing an action on the server side. > >> > Suggestions > >> > 2 & 3 are larger changes and I am reluctant to do that unless there is > >> > a > >> > consensus here. None of them are great options and I am wondering what > >> > everyone thinks the best approach might be? Particularly as I think this > >> > is > >> > likely to come up in more applications than just spark. > >> > > >> > Best, > >> > Ryan > >> > > >
Tracking running threads to close prior to Arrow 1.0.0 release
I started a Google Document to try to assemble outstanding discussion threads with links to the mailing list so we do not lose track of the various items that are up in the air. The document is not complete -- if you would like Edit access to the document please request and I will add you. Feel free to comment also https://docs.google.com/document/d/10QrrJRdgqk5D9RQrkxqwvj3hiuruy8A2jY0teQvXz3s/edit?usp=sharing Thanks, Wes
Re: Spark and Arrow Flight
I think I'd prefer #3 over overloading an existing call (#2). We've been thinking about a similar issue, where sometimes we want just the schema, but the service can't necessarily return the schema without fetching data - right now we return a sentinel value in GetFlightInfo, but a separate RPC would let us explicitly indicate an error. I might be missing something though - what happens between step 1 and 2 that makes the endpoints available? Would it make sense to use DoAction to cause the backend to "prepare" the endpoints, and have the result of that be an encoded schema? So then the flow would be DoAction -> GetFlightInfo -> DoGet. Best, David On 7/1/19, Wes McKinney wrote: > My inclination is either #2 or #3. #4 is an option of course, but I > like the more structured solution of explicitly requesting the schema > given a descriptor. > > In both cases, it's possible that schemas are sent twice, e.g. if you > call GetSchema and then later call GetFlightInfo and so you receive > the schema again. The schema is optional, so if it became a > performance problem then a particular server might return the schema > as null from GetFlightInfo. > > I think it's valid to want to make a single GetFlightInfo RPC request > that returns _both_ the schema and the query plan. > > Thoughts from others? > > On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau wrote: >> >> My initial inclination is towards #3 but I'd be curious what others >> think. >> In the case of #3, I wonder if it makes sense to then pull the Schema off >> the GetFlightInfo response... >> >> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray wrote: >> >> > Hi All, >> > >> > I have been working on building an arrow flight source for spark. The >> > goal >> > here is for Spark to be able to use a group of arrow flight endpoints >> > to >> > get a dataset pulled over to spark in parallel. >> > >> > I am unsure of the best model for the spark <-> flight conversation and >> > wanted to get your opinion on the best way to go. >> > >> > I am breaking up the query to flight from spark into 3 parts: >> > 1) get the schema using GetFlightInfo. This is needed to do further >> > lazy >> > operations in Spark >> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a >> > different >> > argument. This returns the list endpoints on the parallel flight >> > server. >> > The endpoints are not available till data is ready to be fetched, which >> > is >> > done after the schema but is needed before DoGet is called. >> > 3) call get stream on all endpoints from 2 >> > >> > I think I have to do each step however I don't like having to call >> > getInfo >> > twice, it doesn't seem very elegant. I see a few options: >> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to >> > differentiate the purpose of each call >> > 2) add an argument to GetFlightInfo to tell it its being called only >> > for >> > the schema >> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return >> > just >> > the Schema in question >> > 4) use DoAction and wrap the expected FlightInfo in a Result >> > >> > I am aware that 4 is probably the least disruptive but I'm also not a >> > fan >> > as (to me) it implies performing an action on the server side. >> > Suggestions >> > 2 & 3 are larger changes and I am reluctant to do that unless there is >> > a >> > consensus here. None of them are great options and I am wondering what >> > everyone thinks the best approach might be? Particularly as I think this >> > is >> > likely to come up in more applications than just spark. >> > >> > Best, >> > Ryan >> > >
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
The <0x> solution is downright ugly but I think it's one of the only ways that achieves * backward compatibility (new clients can read old data) * opt-in forward compatibility (if we want to go to the labor of doing so, sort of dangerous) * old clients receiving new data do not blow up (they will see a metadata length of -1) NB 0x would look like: In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32) Out[13]: array([4294967295,128], dtype=uint32) In [14]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.int32) Out[14]: array([ -1, 128], dtype=int32) In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8) Out[15]: array([255, 255, 255, 255, 128, 0, 0, 0], dtype=uint8) Flatbuffers are 32-bit limited so we don't need all 64 bits. Do you know in what circumstances unaligned reads from Flatbuffers might cause an issue? I do not know enough about UB but my understanding is that it causes issues on some specialized platforms where for most modern x86-64 processors and compilers it is not really an issue (though perhaps a performance issue) On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield wrote: > > At least on the read-side we can make this detectable by using something like > <0x> instead of int64_t. On the write side we would > need some sort of default mode that we could flip on/off if we wanted to > maintain compatibility. > > I should say I think we should fix it. Undefined behavior is unpaid debt > that might never be collected or might cause things to fail in difficult to > diagnose ways. And pre-1.0.0 is definitely the time. > > -Micah > > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney wrote: >> >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney wrote: >> > >> > hi Micah, >> > >> > This is definitely unfortunate, I wish we had realized the potential >> > implications of having the Flatbuffer message start on a 4-byte >> > (rather than 8-byte) boundary. The cost of making such a change now >> > would be pretty high since all readers and writers in all languages >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version >> > bump is the last opportunity we have to make a change like this, so we >> > might as well discuss it now. Note that particular implementations >> > could implement compatibility functions to handle the 4 to 8 byte >> > change so that old clients can still be understood. We'd probably want >> > to do this in C++, for example, since users would pretty quickly >> > acquire a new pyarrow version in Spark applications while they are >> > stuck on an old version of the Java libraries. >> >> NB such a backwards compatibility fix would not be forward-compatible, >> so the PySpark users would need to use a pinned version of pyarrow >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK >> >> > >> > - Wes >> > >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield >> > wrote: >> > > >> > > While working on trying to fix undefined behavior for unaligned memory >> > > accesses [1], I ran into an issue with the IPC specification [2] which >> > > prevents us from ever achieving zero-copy memory mapping and having >> > > aligned >> > > accesses (i.e. clean UBSan runs). >> > > >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses. >> > > >> > > In the IPC format we align each message to 8-byte boundaries. We then >> > > write a int32_t integer to to denote the size of flat buffer metadata, >> > > followed immediately by the flatbuffer metadata. This means the >> > > flatbuffer metadata will never be 8 byte aligned. >> > > >> > > Do people care? A simple fix would be to use int64_t instead of int32_t >> > > for length. However, any fix essentially breaks all previous client >> > > library versions or incurs a memory copy. >> > > >> > > [1] https://github.com/apache/arrow/pull/4757 >> > > [2] https://arrow.apache.org/docs/ipc.html
Re: Spark and Arrow Flight
My inclination is either #2 or #3. #4 is an option of course, but I like the more structured solution of explicitly requesting the schema given a descriptor. In both cases, it's possible that schemas are sent twice, e.g. if you call GetSchema and then later call GetFlightInfo and so you receive the schema again. The schema is optional, so if it became a performance problem then a particular server might return the schema as null from GetFlightInfo. I think it's valid to want to make a single GetFlightInfo RPC request that returns _both_ the schema and the query plan. Thoughts from others? On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau wrote: > > My initial inclination is towards #3 but I'd be curious what others think. > In the case of #3, I wonder if it makes sense to then pull the Schema off > the GetFlightInfo response... > > On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray wrote: > > > Hi All, > > > > I have been working on building an arrow flight source for spark. The goal > > here is for Spark to be able to use a group of arrow flight endpoints to > > get a dataset pulled over to spark in parallel. > > > > I am unsure of the best model for the spark <-> flight conversation and > > wanted to get your opinion on the best way to go. > > > > I am breaking up the query to flight from spark into 3 parts: > > 1) get the schema using GetFlightInfo. This is needed to do further lazy > > operations in Spark > > 2) get the endpoints by calling GetFlightInfo a 2nd time with a different > > argument. This returns the list endpoints on the parallel flight server. > > The endpoints are not available till data is ready to be fetched, which is > > done after the schema but is needed before DoGet is called. > > 3) call get stream on all endpoints from 2 > > > > I think I have to do each step however I don't like having to call getInfo > > twice, it doesn't seem very elegant. I see a few options: > > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to > > differentiate the purpose of each call > > 2) add an argument to GetFlightInfo to tell it its being called only for > > the schema > > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just > > the Schema in question > > 4) use DoAction and wrap the expected FlightInfo in a Result > > > > I am aware that 4 is probably the least disruptive but I'm also not a fan > > as (to me) it implies performing an action on the server side. Suggestions > > 2 & 3 are larger changes and I am reluctant to do that unless there is a > > consensus here. None of them are great options and I am wondering what > > everyone thinks the best approach might be? Particularly as I think this is > > likely to come up in more applications than just spark. > > > > Best, > > Ryan > >
[jira] [Created] (ARROW-5820) [Release] Remove undefined variable check from verify script
Sutou Kouhei created ARROW-5820: --- Summary: [Release] Remove undefined variable check from verify script Key: ARROW-5820 URL: https://issues.apache.org/jira/browse/ARROW-5820 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Sutou Kouhei Assignee: Sutou Kouhei Fix For: 0.14.0 External shell scripts may refer unbound variable: {noformat} /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: line 55: PS1: unbound variable {noformat} https://lists.apache.org/thread.html/ebe8551eed2353b248b19084810ff454942b55470b9cf5837aa6cf79@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
hi Micah, Sorry for the delay in feedback. I looked at the document and it seems like a reasonable perspective about forward- and backward-compatibility. It seems like the main thing you are proposing is to apply Semantic Versioning to Format and Library versions separately. That's an interesting idea, my thought had been to have a version number that is FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is more flexible in some ways, so let me clarify for others reading In what you are proposing, the next release would be: Format version: 1.0.0 Library version: 1.0.0 Suppose that 20 major versions down the road we stand at Format version: 1.5.0 Library version: 20.0.0 The minor version of the Format would indicate that there are additions, like new elements in the Type union, but otherwise backward and forward compatible. So the Minor version means "new things, but old clients will not be disrupted if those new things are not used". We've already been doing this since the V4 Format iteration but we have not had a way to signal that there may be new features. As a corollary to this, I wonder if we should create a dual version in the metadata PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE VERSION: not tracked at all So Minor version bumps in the format would trigger a bump in the FeatureVersion. Note that we don't really have a mechanism for clients and servers to report to each other what features they support, so this could help with that when for applications where it might matter. Should backward/forward compatibility be disrupted in the future, then a change to the major version would be required. So in year 2025, say, we might decide that we want to do: Format version: 2.0.0 Library version: 21.0.0 The Format version would live in the project's Documentation, so the Apache releases are only the library version. Regarding your open questions: 1. Should we clean up "warts" on the specification, like redundant information I don't think it's necessary. So if Metadata V5 is Format Version 1.0.0 (currently we are V4, but we're discussing some possible non-forward compatible changes...) I think that's OK. None of these things are "hurting" anything 2. Do we need additional mechanisms for marking some features as experimental? Not sure, but I think this can be mostly addressed through documentation. Flight will still be experimental in 1.0.0, for example. 3. Do we need protocol negotiation mechanisms in Flight Could you explain what you mean? Are you thinking if there is some major revamp of the protocol and you need to switch between a "V1 Flight Protocol" and a "V2 Flight Protocol"? - Wes On Thu, Jun 13, 2019 at 2:17 AM Micah Kornfield wrote: > > Hi Everyone, > I think there might be some ideas that we still need to reach consensus on > for how the format and libraries evolve in a post-1.0.0 release world. > Specifically, I think we need to agree on definitions for > backwards/forwards compatibility and its implications for versioning the > format. > > To this end I put some thoughts down in a Google Doc [1] for the purposes > of discussion. Comments welcome. I will start threads for any comments in > the document that seem to warrant further discussion, and once we reach > consensus I can create a patch to document what we decide on as part of the > specification. > > Thanks, > Micah > > [1] > https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
Re: RecordBatch with Tensors/Arrays
hi Andrew, I'm copying dev@ just so more folks are in the loop On Wed, Jun 19, 2019 at 9:13 AM Andrew Spott wrote: > > I was told to post this here, rather than as an issue on Github. > > > > I'm looking to serialize data that looks something like this: > > ``` > record = { "predicted": , > "truth": , > "loss": , > "index": } > > data = [ > pa.array([record, record, record]), > pa.array([, , ]) > pa.array([, , ]) > ] > > batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) > ``` > > But I'm not sure how to do that, or even if what I'm trying to do is the > right way to do it. We don't support tensors/ndarrays as first-class value types in the Python or C++ libraries. This could be done hypothetically using the new ExtensionType facility. Tensor values would be embedded in an Arrow Binary column. There is already ARROW-1614 open for this. I also opened ARROW-5819 about implementing the Python-side plumbing around this Another possible option is to infer list<...> types from ndarrays (e.g. list> from an ndarray of ndim=2 and dtype=float64), but this has not been implemented. > > What is the difference between `pa.array` and `pa.list_`? This formulation > is an array of structs, but is the struct of arrays formulation of this > possible? i.e.: > * The return value of pa.array is an Array object, which wraps the C++ arrow::Array type, the base class for value sequences. It's data, not metadata * pa.list_ returns an instance of ListType, which is a DataType subclass. It's metadata, not data > ``` > data = [ > pa.array([ , , > ]), > pa.array([ , , > ]), > pa.array([, , ]), > ... > ] > ``` > > Which doesn't currently work. It seems that there is a separation between > '1d arraylike' datatypes and 'pythonlike' datatypes (and 'nd arraylike' > datatypes), so I can't have a struct of an array. > Right. ndarrays as array cell values are not natively part of the Arrow columnar format. But they could be supported through extensions. This would be a nice project for someone to take on in the future - Wes > -Andrew
[jira] [Created] (ARROW-5819) [Python] Store sequences of arbitrary ndarrays (with same type) in Tensor value type
Wes McKinney created ARROW-5819: --- Summary: [Python] Store sequences of arbitrary ndarrays (with same type) in Tensor value type Key: ARROW-5819 URL: https://issues.apache.org/jira/browse/ARROW-5819 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney This can be implemented using extension types, based on outcome of ARROW-1614 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5818) [Java][Gandiva] support varlen output vectors
Pindikura Ravindra created ARROW-5818: - Summary: [Java][Gandiva] support varlen output vectors Key: ARROW-5818 URL: https://issues.apache.org/jira/browse/ARROW-5818 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow 0.14.0 - RC0
The C++/Python source build looks fine to me on the Windows side -- I added Flight support in https://github.com/apache/arrow/pull/4770 I opened https://issues.apache.org/jira/browse/ARROW-5817 as there is a risk that Flight Python tests might be silently skipped. We check in our Python package builds that pyarrow.flight can be imported successfully so I don't think those packages are at risk of having a problem On Mon, Jul 1, 2019 at 11:48 AM Wes McKinney wrote: > > hi Antoine, I'm not sure the origin of the conda.sh failure, have you > tried removing any bashrc stuff related to the Anaconda distribution > that you develop against? > > With the following patch I'm able to run the binary verification > > https://github.com/apache/arrow/pull/4768 > > but it failed with > > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a > > Indeed a sig is missing from bintray. I was able to get the parallel > build to run on my machine (but it failed when I piped stdin/stdout to > a file) but I also found a bad sig > > https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5 > > I'm going to work on verifying more components. C# is failing with > > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f > > but I don't think that should block the release (it would be nice if > it passed though) > > I'm going to work on the Windows verification script and see if I can > add Flight support to it > > All in all appears that an RC1 may be warranted unless the signature > issues can be remedied in RC0. Seems like we might need to find an > artifact staging solution that is not Bintray if API rate limits are > going to be a problem. > > - Wes > > On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou wrote: > > > > > > On Ubuntu 18.04: > > > > - failed to verify binaries > > > > """ > > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU > > for details.' > > Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details. > > """ > > > > There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of > > zombie curl processes running... > > > > - failed to verify sources > > > > """ > > + export PATH > > /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: > > line 55: PS1: unbound variable > > + ask_conda= > > + return 1 > > + cleanup > > + '[' no = yes ']' > > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X > > for details.' > > Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details. > > """ > > > > There's no details in /tmp/arrow-0.14.0.yum2X > > > > Regards > > > > Antoine. > > > > > > > > > > > > Le 01/07/2019 à 07:32, Sutou Kouhei a écrit : > > > Hi, > > > > > > I would like to propose the following release candidate (RC0) of Apache > > > Arrow version 0.14.0. This is a release consiting of 618 > > > resolved JIRA issues[1]. > > > > > > This release candidate is based on commit: > > > a591d76ad9a657110368aa422bb00f4010cb6b6e [2] > > > > > > The source release rc0 is hosted at [3]. > > > The binary artifacts are hosted at [4][5][6][7]. > > > The changelog is located at [8]. > > > > > > Please download, verify checksums and signatures, run the unit tests, > > > and vote on the release. See [9] for how to validate a release candidate. > > > > > > NOTE: You must use verify-release-candidate.sh at master. > > > I've fixed some problems after apache-arrow-0.14.0 tag. > > > C#'s "sourcelink test" is fragile. (Network related problem?) > > > It may be better that we add retry logic to "sourcelink test". > > > > > > The vote will be open for at least 72 hours. > > > > > > [ ] +1 Release this as Apache Arrow 0.14.0 > > > [ ] +0 > > > [ ] -1 Do not release this as Apache Arrow 0.14.0 because... > > > > > > [1]: > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0 > > > [2]: > > > https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e > > > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0 > > > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0 > > > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0 > > > [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0 > > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0 > > > [8]: > > > https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md > > > [9]: > > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > > > > > > > > > Thanks, > > > -- > > > kou > > >
[jira] [Created] (ARROW-5817) [Python] Use pytest marks for Flight test to avoid silently skipping unit tests due to import failures
Wes McKinney created ARROW-5817: --- Summary: [Python] Use pytest marks for Flight test to avoid silently skipping unit tests due to import failures Key: ARROW-5817 URL: https://issues.apache.org/jira/browse/ARROW-5817 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.14.0 The approach used to determine whether or not Flight has been built will fail silently if the extension is built but there is an ImportError caused by linking or other issues https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_flight.py#L35 We should use the same "auto" approach as other optional components (see https://github.com/apache/arrow/blob/master/python/pyarrow/tests/conftest.py#L40) with the option for forced opt-in (so that ImportError does not cause silently skipping) so that {{--flight}} will force the tests to run if we expect them to work -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow 0.14.0 - RC0
hi Antoine, I'm not sure the origin of the conda.sh failure, have you tried removing any bashrc stuff related to the Anaconda distribution that you develop against? With the following patch I'm able to run the binary verification https://github.com/apache/arrow/pull/4768 but it failed with https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a Indeed a sig is missing from bintray. I was able to get the parallel build to run on my machine (but it failed when I piped stdin/stdout to a file) but I also found a bad sig https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5 I'm going to work on verifying more components. C# is failing with https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f but I don't think that should block the release (it would be nice if it passed though) I'm going to work on the Windows verification script and see if I can add Flight support to it All in all appears that an RC1 may be warranted unless the signature issues can be remedied in RC0. Seems like we might need to find an artifact staging solution that is not Bintray if API rate limits are going to be a problem. - Wes On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou wrote: > > > On Ubuntu 18.04: > > - failed to verify binaries > > """ > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU > for details.' > Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details. > """ > > There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of > zombie curl processes running... > > - failed to verify sources > > """ > + export PATH > /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: > line 55: PS1: unbound variable > + ask_conda= > + return 1 > + cleanup > + '[' no = yes ']' > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X > for details.' > Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details. > """ > > There's no details in /tmp/arrow-0.14.0.yum2X > > Regards > > Antoine. > > > > > > Le 01/07/2019 à 07:32, Sutou Kouhei a écrit : > > Hi, > > > > I would like to propose the following release candidate (RC0) of Apache > > Arrow version 0.14.0. This is a release consiting of 618 > > resolved JIRA issues[1]. > > > > This release candidate is based on commit: > > a591d76ad9a657110368aa422bb00f4010cb6b6e [2] > > > > The source release rc0 is hosted at [3]. > > The binary artifacts are hosted at [4][5][6][7]. > > The changelog is located at [8]. > > > > Please download, verify checksums and signatures, run the unit tests, > > and vote on the release. See [9] for how to validate a release candidate. > > > > NOTE: You must use verify-release-candidate.sh at master. > > I've fixed some problems after apache-arrow-0.14.0 tag. > > C#'s "sourcelink test" is fragile. (Network related problem?) > > It may be better that we add retry logic to "sourcelink test". > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 Release this as Apache Arrow 0.14.0 > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow 0.14.0 because... > > > > [1]: > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0 > > [2]: > > https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e > > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0 > > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0 > > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0 > > [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0 > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0 > > [8]: > > https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md > > [9]: > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > > > > > > Thanks, > > -- > > kou > >
[jira] [Created] (ARROW-5816) [Release] Parallel curl does not work reliably in verify-release-candidate-sh
Wes McKinney created ARROW-5816: --- Summary: [Release] Parallel curl does not work reliably in verify-release-candidate-sh Key: ARROW-5816 URL: https://issues.apache.org/jira/browse/ARROW-5816 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.14.0 Script can exit early without waiting for curl processes to finish -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5815) [Java] Support swap functionality for fixed-width vectors
Liya Fan created ARROW-5815: --- Summary: [Java] Support swap functionality for fixed-width vectors Key: ARROW-5815 URL: https://issues.apache.org/jira/browse/ARROW-5815 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Support swapping data elements for fixed-width vectors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5814) [Java] Implement a HashMap for DictionaryEncoder
Ji Liu created ARROW-5814: - Summary: [Java] Implement a HashMap for DictionaryEncoder Key: ARROW-5814 URL: https://issues.apache.org/jira/browse/ARROW-5814 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu As a follow-up of [ARROW-5726|https://issues.apache.org/jira/browse/ARROW-5726]. Implement a Map for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow 0.14.0 - RC0
On Ubuntu 18.04: - failed to verify binaries """ + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.' Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details. """ There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of zombie curl processes running... - failed to verify sources """ + export PATH /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh: line 55: PS1: unbound variable + ask_conda= + return 1 + cleanup + '[' no = yes ']' + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details.' Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details. """ There's no details in /tmp/arrow-0.14.0.yum2X Regards Antoine. Le 01/07/2019 à 07:32, Sutou Kouhei a écrit : > Hi, > > I would like to propose the following release candidate (RC0) of Apache > Arrow version 0.14.0. This is a release consiting of 618 > resolved JIRA issues[1]. > > This release candidate is based on commit: > a591d76ad9a657110368aa422bb00f4010cb6b6e [2] > > The source release rc0 is hosted at [3]. > The binary artifacts are hosted at [4][5][6][7]. > The changelog is located at [8]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. See [9] for how to validate a release candidate. > > NOTE: You must use verify-release-candidate.sh at master. > I've fixed some problems after apache-arrow-0.14.0 tag. > C#'s "sourcelink test" is fragile. (Network related problem?) > It may be better that we add retry logic to "sourcelink test". > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow 0.14.0 > [ ] +0 > [ ] -1 Do not release this as Apache Arrow 0.14.0 because... > > [1]: > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0 > [2]: > https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0 > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0 > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0 > [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0 > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0 > [8]: > https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md > [9]: > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > > > Thanks, > -- > kou >