Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-01 Thread Wes McKinney
+1 (binding)

Thanks Kou for adding the missing signatures.

* I was able to verify the binaries after the signature fix. The Linux
package tests are very nice!
* I ran the following source verifications (on linux except where noted)
  * C++ (Ubuntu 19.04 and Windows, with patch
https://github.com/apache/arrow/pull/4770)
  * Python (UB19.04 / Windows)
  * Java
  * JS
  * Ruby
  * GLib
  * Go
  * Rust
  * Integration tests with Flight (with minor patch
https://github.com/apache/arrow/pull/4775)

I only had trouble with C#, and it may be environment specific.

On Mon, Jul 1, 2019 at 4:32 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > but it failed with
> >
> > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a
>
> Thanks for catching this.
> I failed to upload some files. I uploaded missing files.
>
> I confirmed that there are no missing files with the
> following Ruby script:
>
> --
> #!/usr/bin/env ruby
>
> require "open-uri"
> require "json"
> require "English"
>
> ["debian", "ubuntu", "centos", "python"].each do |target|
>   json_path = "/tmp/#{target}-file-list.json"
>   unless File.exist?(json_path)
> 
> open("https://bintray.com/api/v1/packages/apache/arrow/#{target}-rc/versions/0.14.0-rc0/files";)
>  do |input|
>   File.open(json_path, "w") do |json|
> IO.copy_stream(input, json)
>   end
> end
>   end
>
>   source_paths = []
>   asc_paths = []
>   sha512_paths = []
>   JSON.parse(File.read(json_path)).each do |entry|
> path = entry["path"]
> case path
> when /\.asc\z/
>   asc_paths << $PREMATCH
> when /\.sha512\z/
>   sha512_paths << $PREMATCH
> else
>   source_paths << path
> end
>   end
>   pp([:no_asc, source_paths - asc_paths])
>   pp([:no_source_for_asc, asc_paths - source_paths])
>   pp([:no_sha512, source_paths - sha512_paths])
>   pp([:no_source_for_sha512, sha512_paths - source_paths])
> end
> --
>
> But this is a bit strange. Download file list is read from
> Bintray (*). So I think that our verification script doesn't
> try downloading nonexistent files...
>
> (*) 
> https://bintray.com/api/v1/packages/apache/arrow/debian-rc/versions/0.14.0-rc0/files
>
> > I'm going to work on verifying more components. C# is failing with
> >
> > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f
>
> I couldn't reproduce this on my environment.
> I'll try this with clean environment.
>
> Note that we can try only C# verification with the following
> command line:
>
>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1 
> dev/release/verify-release-candidate.sh source 0.14.0 0
>
> > Seems like we might need to find an
> > artifact staging solution that is not Bintray if API rate limits are
> > going to be a problem.
>
> I don't get response yet from https://bintray.com/apache
> organization. I'll open an issue on INFRA JIRA.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Mon, 1 Jul 2019 11:48:50 
> -0500,
>   Wes McKinney  wrote:
>
> > hi Antoine, I'm not sure the origin of the conda.sh failure, have you
> > tried removing any bashrc stuff related to the Anaconda distribution
> > that you develop against?
> >
> > With the following patch I'm able to run the binary verification
> >
> > https://github.com/apache/arrow/pull/4768
> >
> > but it failed with
> >
> > https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a
> >
> > Indeed a sig is missing from bintray. I was able to get the parallel
> > build to run on my machine (but it failed when I piped stdin/stdout to
> > a file) but I also found a bad sig
> >
> > https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5
> >
> > I'm going to work on verifying more components. C# is failing with
> >
> > https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f
> >
> > but I don't think that should block the release (it would be nice if
> > it passed though)
> >
> > I'm going to work on the Windows verification script and see if I can
> > add Flight support to it
> >
> > All in all appears that an RC1 may be warranted unless the signature
> > issues can be remedied in RC0. Seems like we might need to find an
> > artifact staging solution that is not Bintray if API rate limits are
> > going to be a problem.
> >
> > - Wes
> >
> > On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou  wrote:
> >>
> >>
> >> On Ubuntu 18.04:
> >>
> >> - failed to verify binaries
> >>
> >> """
> >> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
> >> for details.'
> >> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for 
> >> details.
> >> """
> >>
> >> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
> >> zombie curl processes running...
> >>
> >> - failed to verify sources
> >>
> >> """
> >> + export PATH
> >> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
> >> line 55: PS1: unbound variable
> >> + ask_conda=
> >> + return 1
> >> + cleanup
> >> + '[' no = yes ']'
> >> + echo 

Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-01 Thread Micah Kornfield
Hi Wes,
Thanks for your response.  In regards to the protocol negotiation your
description of feature reporting (snipped below) is along the lines of what
I was thinking.  It might not be necessary for 1.0.0, but at some point
might become useful.


>  Note that we don't really have a mechanism for clients
> and servers to report to each other what features they support, so
> this could help with that when for applications where it might matter.


Thanks,
Micah


On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney  wrote:

> hi Micah,
>
> Sorry for the delay in feedback. I looked at the document and it seems
> like a reasonable perspective about forward- and
> backward-compatibility.
>
> It seems like the main thing you are proposing is to apply Semantic
> Versioning to Format and Library versions separately. That's an
> interesting idea, my thought had been to have a version number that is
> FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is
> more flexible in some ways, so let me clarify for others reading
>
> In what you are proposing, the next release would be:
>
> Format version: 1.0.0
> Library version: 1.0.0
>
> Suppose that 20 major versions down the road we stand at
>
> Format version: 1.5.0
> Library version: 20.0.0
>
> The minor version of the Format would indicate that there are
> additions, like new elements in the Type union, but otherwise backward
> and forward compatible. So the Minor version means "new things, but
> old clients will not be disrupted if those new things are not used".
> We've already been doing this since the V4 Format iteration but we
> have not had a way to signal that there may be new features. As a
> corollary to this, I wonder if we should create a dual version in the
> metadata
>
> PROTOCOL VERSION: (what is currently MetadataVersion, V2)
> FEATURE VERSION: not tracked at all
>
> So Minor version bumps in the format would trigger a bump in the
> FeatureVersion. Note that we don't really have a mechanism for clients
> and servers to report to each other what features they support, so
> this could help with that when for applications where it might matter.
>
> Should backward/forward compatibility be disrupted in the future, then
> a change to the major version would be required. So in year 2025, say,
> we might decide that we want to do:
>
> Format version: 2.0.0
> Library version: 21.0.0
>
> The Format version would live in the project's Documentation, so the
> Apache releases are only the library version.
>
> Regarding your open questions:
>
> 1. Should we clean up "warts" on the specification, like redundant
> information
>
> I don't think it's necessary. So if Metadata V5 is Format Version
> 1.0.0 (currently we are V4, but we're discussing some possible
> non-forward compatible changes...) I think that's OK. None of these
> things are "hurting" anything
>
> 2. Do we need additional mechanisms for marking some features as
> experimental?
>
> Not sure, but I think this can be mostly addressed through
> documentation. Flight will still be experimental in 1.0.0, for
> example.
>
> 3. Do we need protocol negotiation mechanisms in Flight
>
> Could you explain what you mean? Are you thinking if there is some
> major revamp of the protocol and you need to switch between a "V1
> Flight Protocol" and a "V2 Flight Protocol"?
>
> - Wes
>
> On Thu, Jun 13, 2019 at 2:17 AM Micah Kornfield 
> wrote:
> >
> > Hi Everyone,
> > I think there might be some ideas that we still need to reach consensus
> on
> > for how the format and libraries evolve in a post-1.0.0 release world.
> >  Specifically, I think we need to agree on definitions for
> > backwards/forwards compatibility and its implications for versioning the
> > format.
> >
> > To this end I put some thoughts down in a Google Doc [1] for the purposes
> > of discussion.  Comments welcome.  I will start threads for any comments
> in
> > the document that seem to warrant further discussion, and once we reach
> > consensus I can create a patch to document what we decide on as part of
> the
> > specification.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
>


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-01 Thread Wes McKinney
Thanks for the references.

If we decided to make a change around this, we could call the first 4
bytes a stream continuation marker to make it slightly less ugly

* 0x: continue
* 0x: stop

On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield  wrote:
>
> Hi Wes,
> I'm not an expert on this either, my inclination mostly comes from some 
> research I've done.  I think it is important to distinguish two cases:
> 1.  unaligned access at the processor instruction level
> 2.  undefined behavior
>
> From my reading unaligned access is fine on most modern architectures and it 
> seems the performance penalty has mostly been eliminated.
>
> Undefined behavior is a compiler/language concept.  The problem is the 
> compiler can choose to do anything in UB scenarios, not just the "obvious" 
> translation.  Specifically, the compiler is under no obligation to generate 
> the unaligned access instructions, and if it doesn't SEGVs ensue.  Two 
> examples, both of which relate to SIMD optimizations are linked below.
>
> I tend to be on the conservative side with this type of thing but if we have 
> experts on the the ML that can offer a more informed opinion, I would love to 
> hear it.
>
> [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
>
> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney  wrote:
>>
>> The <0x> solution is downright ugly but I think
>> it's one of the only ways that achieves
>>
>> * backward compatibility (new clients can read old data)
>> * opt-in forward compatibility (if we want to go to the labor of doing
>> so, sort of dangerous)
>> * old clients receiving new data do not blow up (they will see a
>> metadata length of -1)
>>
>> NB 0x  would look like:
>>
>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
>> Out[13]: array([4294967295,128], dtype=uint32)
>>
>> In [14]: np.array([(2 << 32) - 1, 128],
>> dtype=np.uint32).view(np.int32)
>> Out[14]: array([ -1, 128], dtype=int32)
>>
>> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
>>
>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
>>
>> Do you know in what circumstances unaligned reads from Flatbuffers
>> might cause an issue? I do not know enough about UB but my
>> understanding is that it causes issues on some specialized platforms
>> where for most modern x86-64 processors and compilers it is not really
>> an issue (though perhaps a performance issue)
>>
>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield  
>> wrote:
>> >
>> > At least on the read-side we can make this detectable by using something 
>> > like <0x> instead of int64_t.  On the write side we 
>> > would need some sort of default mode that we could flip on/off if we 
>> > wanted to maintain compatibility.
>> >
>> > I should say I think we should fix it.  Undefined behavior is unpaid debt 
>> > that might never be collected or might cause things to fail in difficult 
>> > to diagnose ways. And pre-1.0.0 is definitely the time.
>> >
>> > -Micah
>> >
>> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney  wrote:
>> >>
>> >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney  wrote:
>> >> >
>> >> > hi Micah,
>> >> >
>> >> > This is definitely unfortunate, I wish we had realized the potential
>> >> > implications of having the Flatbuffer message start on a 4-byte
>> >> > (rather than 8-byte) boundary. The cost of making such a change now
>> >> > would be pretty high since all readers and writers in all languages
>> >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
>> >> > bump is the last opportunity we have to make a change like this, so we
>> >> > might as well discuss it now. Note that particular implementations
>> >> > could implement compatibility functions to handle the 4 to 8 byte
>> >> > change so that old clients can still be understood. We'd probably want
>> >> > to do this in C++, for example, since users would pretty quickly
>> >> > acquire a new pyarrow version in Spark applications while they are
>> >> > stuck on an old version of the Java libraries.
>> >>
>> >> NB such a backwards compatibility fix would not be forward-compatible,
>> >> so the PySpark users would need to use a pinned version of pyarrow
>> >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>> >>
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield  
>> >> > wrote:
>> >> > >
>> >> > > While working on trying to fix undefined behavior for unaligned memory
>> >> > > accesses [1], I ran into an issue with the IPC specification [2] which
>> >> > > prevents us from ever achieving zero-copy memory mapping and having 
>> >> > > aligned
>> >> > > accesses (i.e. clean UBSan runs).
>> >> > >
>> >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned 
>> >> > > accesses.
>> >> > >
>> >> > > In the IPC format we align ea

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-01 Thread Micah Kornfield
Hi Wes,
I'm not an expert on this either, my inclination mostly comes from some
research I've done.  I think it is important to distinguish two cases:
1.  unaligned access at the processor instruction level
2.  undefined behavior

>From my reading unaligned access is fine on most modern architectures and
it seems the performance penalty has mostly been eliminated.

Undefined behavior is a compiler/language concept.  The problem is the
compiler can choose to do anything in UB scenarios, not just the "obvious"
translation.  Specifically, the compiler is under no obligation to generate
the unaligned access instructions, and if it doesn't SEGVs ensue.  Two
examples, both of which relate to SIMD optimizations are linked below.

I tend to be on the conservative side with this type of thing but if we
have experts on the the ML that can offer a more informed opinion, I would
love to hear it.

[1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney  wrote:

> The <0x> solution is downright ugly but I think
> it's one of the only ways that achieves
>
> * backward compatibility (new clients can read old data)
> * opt-in forward compatibility (if we want to go to the labor of doing
> so, sort of dangerous)
> * old clients receiving new data do not blow up (they will see a
> metadata length of -1)
>
> NB 0x  would look like:
>
> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> Out[13]: array([4294967295,128], dtype=uint32)
>
> In [14]: np.array([(2 << 32) - 1, 128],
> dtype=np.uint32).view(np.int32)
> Out[14]: array([ -1, 128], dtype=int32)
>
> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
>
> Flatbuffers are 32-bit limited so we don't need all 64 bits.
>
> Do you know in what circumstances unaligned reads from Flatbuffers
> might cause an issue? I do not know enough about UB but my
> understanding is that it causes issues on some specialized platforms
> where for most modern x86-64 processors and compilers it is not really
> an issue (though perhaps a performance issue)
>
> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield 
> wrote:
> >
> > At least on the read-side we can make this detectable by using something
> like <0x> instead of int64_t.  On the write side we
> would need some sort of default mode that we could flip on/off if we wanted
> to maintain compatibility.
> >
> > I should say I think we should fix it.  Undefined behavior is unpaid
> debt that might never be collected or might cause things to fail in
> difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> >
> > -Micah
> >
> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney 
> wrote:
> >>
> >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney 
> wrote:
> >> >
> >> > hi Micah,
> >> >
> >> > This is definitely unfortunate, I wish we had realized the potential
> >> > implications of having the Flatbuffer message start on a 4-byte
> >> > (rather than 8-byte) boundary. The cost of making such a change now
> >> > would be pretty high since all readers and writers in all languages
> >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> >> > bump is the last opportunity we have to make a change like this, so we
> >> > might as well discuss it now. Note that particular implementations
> >> > could implement compatibility functions to handle the 4 to 8 byte
> >> > change so that old clients can still be understood. We'd probably want
> >> > to do this in C++, for example, since users would pretty quickly
> >> > acquire a new pyarrow version in Spark applications while they are
> >> > stuck on an old version of the Java libraries.
> >>
> >> NB such a backwards compatibility fix would not be forward-compatible,
> >> so the PySpark users would need to use a pinned version of pyarrow
> >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> >>
> >> >
> >> > - Wes
> >> >
> >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> >> > >
> >> > > While working on trying to fix undefined behavior for unaligned
> memory
> >> > > accesses [1], I ran into an issue with the IPC specification [2]
> which
> >> > > prevents us from ever achieving zero-copy memory mapping and having
> aligned
> >> > > accesses (i.e. clean UBSan runs).
> >> > >
> >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> >> > >
> >> > > In the IPC format we align each message to 8-byte boundaries.  We
> then
> >> > > write a int32_t integer to to denote the size of flat buffer
> metadata,
> >> > > followed immediately  by the flatbuffer metadata.  This means the
> >> > > flatbuffer metadata will never be 8 byte aligned.
> >> > >
> >> > > Do people care?  A simple fix  would be to use int64_t instead of
> int32_t
> >> > > for length.  However, any fix essential

Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-01 Thread Sutou Kouhei
Hi,

> but it failed with
> 
> https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a

Thanks for catching this.
I failed to upload some files. I uploaded missing files.

I confirmed that there are no missing files with the
following Ruby script:

--
#!/usr/bin/env ruby

require "open-uri"
require "json"
require "English"

["debian", "ubuntu", "centos", "python"].each do |target|
  json_path = "/tmp/#{target}-file-list.json"
  unless File.exist?(json_path)

open("https://bintray.com/api/v1/packages/apache/arrow/#{target}-rc/versions/0.14.0-rc0/files";)
 do |input|
  File.open(json_path, "w") do |json|
IO.copy_stream(input, json)
  end
end
  end

  source_paths = []
  asc_paths = []
  sha512_paths = []
  JSON.parse(File.read(json_path)).each do |entry|
path = entry["path"]
case path
when /\.asc\z/
  asc_paths << $PREMATCH
when /\.sha512\z/
  sha512_paths << $PREMATCH
else
  source_paths << path
end
  end
  pp([:no_asc, source_paths - asc_paths])
  pp([:no_source_for_asc, asc_paths - source_paths])
  pp([:no_sha512, source_paths - sha512_paths])
  pp([:no_source_for_sha512, sha512_paths - source_paths])
end
--

But this is a bit strange. Download file list is read from
Bintray (*). So I think that our verification script doesn't
try downloading nonexistent files...

(*) 
https://bintray.com/api/v1/packages/apache/arrow/debian-rc/versions/0.14.0-rc0/files

> I'm going to work on verifying more components. C# is failing with
> 
> https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f

I couldn't reproduce this on my environment.
I'll try this with clean environment.

Note that we can try only C# verification with the following
command line:

  TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1 
dev/release/verify-release-candidate.sh source 0.14.0 0

> Seems like we might need to find an
> artifact staging solution that is not Bintray if API rate limits are
> going to be a problem.

I don't get response yet from https://bintray.com/apache
organization. I'll open an issue on INFRA JIRA.


Thanks,
--
kou

In 
  "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Mon, 1 Jul 2019 11:48:50 
-0500,
  Wes McKinney  wrote:

> hi Antoine, I'm not sure the origin of the conda.sh failure, have you
> tried removing any bashrc stuff related to the Anaconda distribution
> that you develop against?
> 
> With the following patch I'm able to run the binary verification
> 
> https://github.com/apache/arrow/pull/4768
> 
> but it failed with
> 
> https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a
> 
> Indeed a sig is missing from bintray. I was able to get the parallel
> build to run on my machine (but it failed when I piped stdin/stdout to
> a file) but I also found a bad sig
> 
> https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5
> 
> I'm going to work on verifying more components. C# is failing with
> 
> https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f
> 
> but I don't think that should block the release (it would be nice if
> it passed though)
> 
> I'm going to work on the Windows verification script and see if I can
> add Flight support to it
> 
> All in all appears that an RC1 may be warranted unless the signature
> issues can be remedied in RC0. Seems like we might need to find an
> artifact staging solution that is not Bintray if API rate limits are
> going to be a problem.
> 
> - Wes
> 
> On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou  wrote:
>>
>>
>> On Ubuntu 18.04:
>>
>> - failed to verify binaries
>>
>> """
>> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
>> for details.'
>> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.
>> """
>>
>> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
>> zombie curl processes running...
>>
>> - failed to verify sources
>>
>> """
>> + export PATH
>> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
>> line 55: PS1: unbound variable
>> + ask_conda=
>> + return 1
>> + cleanup
>> + '[' no = yes ']'
>> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X
>> for details.'
>> Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details.
>> """
>>
>> There's no details in /tmp/arrow-0.14.0.yum2X
>>
>> Regards
>>
>> Antoine.
>>
>>
>>
>>
>>
>> Le 01/07/2019 à 07:32, Sutou Kouhei a écrit :
>> > Hi,
>> >
>> > I would like to propose the following release candidate (RC0) of Apache
>> > Arrow version 0.14.0. This is a release consiting of 618
>> > resolved JIRA issues[1].
>> >
>> > This release candidate is based on commit:
>> > a591d76ad9a657110368aa422bb00f4010cb6b6e [2]
>> >
>> > The source release rc0 is hosted at [3].
>> > The binary artifacts are hosted at [4][5][6][7].
>> > The changelog is located at [8].
>> >
>> > Please download, verify checksums and signatures, run the unit tests,
>> > and vote on the release. See [9] for how to validate a release ca

Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-01 Thread Sutou Kouhei
Hi,

Thanks for verifying this RC.

> - failed to verify binaries
> 
> """
> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
> for details.'
> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.
> """
> 
> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
> zombie curl processes running...

It seems that one of curl downloads is failed.
Parallel download may be fragile.

https://github.com/apache/arrow/pull/4768 by Wes will solve
this situation. I've merged this.


> - failed to verify sources
> 
> """
> + export PATH
> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
> line 55: PS1: unbound variable

https://github.com/apache/arrow/pull/4773 will solve this.
I added "set -u" to detect using undefined variables caused
by typo in
https://github.com/apache/arrow/commit/9a788dfc976035cabb0d4ab15f0f6fa306a5428d
.

It works well on my environment. But I understand that it's
not portable with shell script that sources external shell
script. (. $MINICONDA/etc/profile.d/conda.sh)

I've removed "set -u" by
https://github.com/apache/arrow/commit/9145c1591aedbd141454cfc7b6aad5190c0fb30e
.


Thanks,
--
kou

In <03cca1d7-7f8f-c46c-2360-132cd300c...@python.org>
  "Re: [VOTE] Release Apache Arrow 0.14.0 - RC0" on Mon, 1 Jul 2019 10:48:18 
+0200,
  Antoine Pitrou  wrote:

> 
> On Ubuntu 18.04:
> 
> - failed to verify binaries
> 
> """
> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
> for details.'
> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.
> """
> 
> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
> zombie curl processes running...
> 
> - failed to verify sources
> 
> """
> + export PATH
> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
> line 55: PS1: unbound variable
> + ask_conda=
> + return 1
> + cleanup
> + '[' no = yes ']'
> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X
> for details.'
> Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details.
> """
> 
> There's no details in /tmp/arrow-0.14.0.yum2X
> 
> Regards
> 
> Antoine.
> 
> 
> 
> 
> 
> Le 01/07/2019 à 07:32, Sutou Kouhei a écrit :
>> Hi,
>> 
>> I would like to propose the following release candidate (RC0) of Apache
>> Arrow version 0.14.0. This is a release consiting of 618
>> resolved JIRA issues[1].
>> 
>> This release candidate is based on commit:
>> a591d76ad9a657110368aa422bb00f4010cb6b6e [2]
>> 
>> The source release rc0 is hosted at [3].
>> The binary artifacts are hosted at [4][5][6][7].
>> The changelog is located at [8].
>> 
>> Please download, verify checksums and signatures, run the unit tests,
>> and vote on the release. See [9] for how to validate a release candidate.
>> 
>> NOTE: You must use verify-release-candidate.sh at master.
>> I've fixed some problems after apache-arrow-0.14.0 tag.
>> C#'s "sourcelink test" is fragile. (Network related problem?)
>> It may be better that we add retry logic to "sourcelink test".
>> 
>> The vote will be open for at least 72 hours.
>> 
>> [ ] +1 Release this as Apache Arrow 0.14.0
>> [ ] +0
>> [ ] -1 Do not release this as Apache Arrow 0.14.0 because...
>> 
>> [1]: 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0
>> [2]: 
>> https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e
>> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0
>> [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0
>> [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0
>> [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0
>> [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0
>> [8]: 
>> https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md
>> [9]: 
>> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>> 
>> 
>> Thanks,
>> --
>> kou
>> 


Re: Spark and Arrow Flight

2019-07-01 Thread Wes McKinney
On Mon, Jul 1, 2019 at 3:50 PM David Li  wrote:
>
> I think I'd prefer #3 over overloading an existing call (#2).
>
> We've been thinking about a similar issue, where sometimes we want
> just the schema, but the service can't necessarily return the schema
> without fetching data - right now we return a sentinel value in
> GetFlightInfo, but a separate RPC would let us explicitly indicate an
> error.
>
> I might be missing something though - what happens between step 1 and
> 2 that makes the endpoints available? Would it make sense to use
> DoAction to cause the backend to "prepare" the endpoints, and have the
> result of that be an encoded schema? So then the flow would be
> DoAction -> GetFlightInfo -> DoGet.

I think it depends on the particular server/planner implementation. If
preparing a dataset is expensive (imagine loading a large dataset into
a distributed cache, then dropping it later), then it might be that
you have:

DoAction: Load/Prepare $DATASET

... clients access the dataset using GetFlightInfo with path $DATASET

DoAction: Drop $DATASET

In other cases GetFlightInfo might contain a SQL query and so having a
separate DoAction workflow is not needed

>
> Best,
> David
>
> On 7/1/19, Wes McKinney  wrote:
> > My inclination is either #2 or #3. #4 is an option of course, but I
> > like the more structured solution of explicitly requesting the schema
> > given a descriptor.
> >
> > In both cases, it's possible that schemas are sent twice, e.g. if you
> > call GetSchema and then later call GetFlightInfo and so you receive
> > the schema again. The schema is optional, so if it became a
> > performance problem then a particular server might return the schema
> > as null from GetFlightInfo.
> >
> > I think it's valid to want to make a single GetFlightInfo RPC request
> > that returns _both_ the schema and the query plan.
> >
> > Thoughts from others?
> >
> > On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau  wrote:
> >>
> >> My initial inclination is towards #3 but I'd be curious what others
> >> think.
> >> In the case of #3, I wonder if it makes sense to then pull the Schema off
> >> the GetFlightInfo response...
> >>
> >> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray  wrote:
> >>
> >> > Hi All,
> >> >
> >> > I have been working on building an arrow flight source for spark. The
> >> > goal
> >> > here is for Spark to be able to use a group of arrow flight endpoints
> >> > to
> >> > get a dataset pulled over to spark in parallel.
> >> >
> >> > I am unsure of the best model for the spark <-> flight conversation and
> >> > wanted to get your opinion on the best way to go.
> >> >
> >> > I am breaking up the query to flight from spark into 3 parts:
> >> > 1) get the schema using GetFlightInfo. This is needed to do further
> >> > lazy
> >> > operations in Spark
> >> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> >> > different
> >> > argument. This returns the list endpoints on the parallel flight
> >> > server.
> >> > The endpoints are not available till data is ready to be fetched, which
> >> > is
> >> > done after the schema but is needed before DoGet is called.
> >> > 3) call get stream on all endpoints from 2
> >> >
> >> > I think I have to do each step however I don't like having to call
> >> > getInfo
> >> > twice, it doesn't seem very elegant. I see a few options:
> >> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
> >> > differentiate the purpose of each call
> >> > 2) add an argument to GetFlightInfo to tell it its being called only
> >> > for
> >> > the schema
> >> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
> >> > just
> >> > the Schema in question
> >> > 4) use DoAction and wrap the expected FlightInfo in a Result
> >> >
> >> > I am aware that 4 is probably the least disruptive but I'm also not a
> >> > fan
> >> > as (to me) it implies performing an action on the server side.
> >> > Suggestions
> >> > 2 & 3 are larger changes and I am reluctant to do that unless there is
> >> > a
> >> > consensus here. None of them are great options and I am wondering what
> >> > everyone thinks the best approach might be? Particularly as I think this
> >> > is
> >> > likely to come up in more applications than just spark.
> >> >
> >> > Best,
> >> > Ryan
> >> >
> >


Tracking running threads to close prior to Arrow 1.0.0 release

2019-07-01 Thread Wes McKinney
I started a Google Document to try to assemble outstanding discussion
threads with links to the mailing list so we do not lose track of the
various items that are up in the air.

The document is not complete -- if you would like Edit access to the
document please request and I will add you. Feel free to comment also

https://docs.google.com/document/d/10QrrJRdgqk5D9RQrkxqwvj3hiuruy8A2jY0teQvXz3s/edit?usp=sharing

Thanks,
Wes


Re: Spark and Arrow Flight

2019-07-01 Thread David Li
I think I'd prefer #3 over overloading an existing call (#2).

We've been thinking about a similar issue, where sometimes we want
just the schema, but the service can't necessarily return the schema
without fetching data - right now we return a sentinel value in
GetFlightInfo, but a separate RPC would let us explicitly indicate an
error.

I might be missing something though - what happens between step 1 and
2 that makes the endpoints available? Would it make sense to use
DoAction to cause the backend to "prepare" the endpoints, and have the
result of that be an encoded schema? So then the flow would be
DoAction -> GetFlightInfo -> DoGet.

Best,
David

On 7/1/19, Wes McKinney  wrote:
> My inclination is either #2 or #3. #4 is an option of course, but I
> like the more structured solution of explicitly requesting the schema
> given a descriptor.
>
> In both cases, it's possible that schemas are sent twice, e.g. if you
> call GetSchema and then later call GetFlightInfo and so you receive
> the schema again. The schema is optional, so if it became a
> performance problem then a particular server might return the schema
> as null from GetFlightInfo.
>
> I think it's valid to want to make a single GetFlightInfo RPC request
> that returns _both_ the schema and the query plan.
>
> Thoughts from others?
>
> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau  wrote:
>>
>> My initial inclination is towards #3 but I'd be curious what others
>> think.
>> In the case of #3, I wonder if it makes sense to then pull the Schema off
>> the GetFlightInfo response...
>>
>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray  wrote:
>>
>> > Hi All,
>> >
>> > I have been working on building an arrow flight source for spark. The
>> > goal
>> > here is for Spark to be able to use a group of arrow flight endpoints
>> > to
>> > get a dataset pulled over to spark in parallel.
>> >
>> > I am unsure of the best model for the spark <-> flight conversation and
>> > wanted to get your opinion on the best way to go.
>> >
>> > I am breaking up the query to flight from spark into 3 parts:
>> > 1) get the schema using GetFlightInfo. This is needed to do further
>> > lazy
>> > operations in Spark
>> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a
>> > different
>> > argument. This returns the list endpoints on the parallel flight
>> > server.
>> > The endpoints are not available till data is ready to be fetched, which
>> > is
>> > done after the schema but is needed before DoGet is called.
>> > 3) call get stream on all endpoints from 2
>> >
>> > I think I have to do each step however I don't like having to call
>> > getInfo
>> > twice, it doesn't seem very elegant. I see a few options:
>> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
>> > differentiate the purpose of each call
>> > 2) add an argument to GetFlightInfo to tell it its being called only
>> > for
>> > the schema
>> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
>> > just
>> > the Schema in question
>> > 4) use DoAction and wrap the expected FlightInfo in a Result
>> >
>> > I am aware that 4 is probably the least disruptive but I'm also not a
>> > fan
>> > as (to me) it implies performing an action on the server side.
>> > Suggestions
>> > 2 & 3 are larger changes and I am reluctant to do that unless there is
>> > a
>> > consensus here. None of them are great options and I am wondering what
>> > everyone thinks the best approach might be? Particularly as I think this
>> > is
>> > likely to come up in more applications than just spark.
>> >
>> > Best,
>> > Ryan
>> >
>


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-01 Thread Wes McKinney
The <0x> solution is downright ugly but I think
it's one of the only ways that achieves

* backward compatibility (new clients can read old data)
* opt-in forward compatibility (if we want to go to the labor of doing
so, sort of dangerous)
* old clients receiving new data do not blow up (they will see a
metadata length of -1)

NB 0x  would look like:

In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
Out[13]: array([4294967295,128], dtype=uint32)

In [14]: np.array([(2 << 32) - 1, 128],
dtype=np.uint32).view(np.int32)
Out[14]: array([ -1, 128], dtype=int32)

In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)

Flatbuffers are 32-bit limited so we don't need all 64 bits.

Do you know in what circumstances unaligned reads from Flatbuffers
might cause an issue? I do not know enough about UB but my
understanding is that it causes issues on some specialized platforms
where for most modern x86-64 processors and compilers it is not really
an issue (though perhaps a performance issue)

On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield  wrote:
>
> At least on the read-side we can make this detectable by using something like 
> <0x> instead of int64_t.  On the write side we would 
> need some sort of default mode that we could flip on/off if we wanted to 
> maintain compatibility.
>
> I should say I think we should fix it.  Undefined behavior is unpaid debt 
> that might never be collected or might cause things to fail in difficult to 
> diagnose ways. And pre-1.0.0 is definitely the time.
>
> -Micah
>
> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney  wrote:
>>
>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney  wrote:
>> >
>> > hi Micah,
>> >
>> > This is definitely unfortunate, I wish we had realized the potential
>> > implications of having the Flatbuffer message start on a 4-byte
>> > (rather than 8-byte) boundary. The cost of making such a change now
>> > would be pretty high since all readers and writers in all languages
>> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
>> > bump is the last opportunity we have to make a change like this, so we
>> > might as well discuss it now. Note that particular implementations
>> > could implement compatibility functions to handle the 4 to 8 byte
>> > change so that old clients can still be understood. We'd probably want
>> > to do this in C++, for example, since users would pretty quickly
>> > acquire a new pyarrow version in Spark applications while they are
>> > stuck on an old version of the Java libraries.
>>
>> NB such a backwards compatibility fix would not be forward-compatible,
>> so the PySpark users would need to use a pinned version of pyarrow
>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>>
>> >
>> > - Wes
>> >
>> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield  
>> > wrote:
>> > >
>> > > While working on trying to fix undefined behavior for unaligned memory
>> > > accesses [1], I ran into an issue with the IPC specification [2] which
>> > > prevents us from ever achieving zero-copy memory mapping and having 
>> > > aligned
>> > > accesses (i.e. clean UBSan runs).
>> > >
>> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>> > >
>> > > In the IPC format we align each message to 8-byte boundaries.  We then
>> > > write a int32_t integer to to denote the size of flat buffer metadata,
>> > > followed immediately  by the flatbuffer metadata.  This means the
>> > > flatbuffer metadata will never be 8 byte aligned.
>> > >
>> > > Do people care?  A simple fix  would be to use int64_t instead of int32_t
>> > > for length.  However, any fix essentially breaks all previous client
>> > > library versions or incurs a memory copy.
>> > >
>> > > [1] https://github.com/apache/arrow/pull/4757
>> > > [2] https://arrow.apache.org/docs/ipc.html


Re: Spark and Arrow Flight

2019-07-01 Thread Wes McKinney
My inclination is either #2 or #3. #4 is an option of course, but I
like the more structured solution of explicitly requesting the schema
given a descriptor.

In both cases, it's possible that schemas are sent twice, e.g. if you
call GetSchema and then later call GetFlightInfo and so you receive
the schema again. The schema is optional, so if it became a
performance problem then a particular server might return the schema
as null from GetFlightInfo.

I think it's valid to want to make a single GetFlightInfo RPC request
that returns _both_ the schema and the query plan.

Thoughts from others?

On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau  wrote:
>
> My initial inclination is towards #3 but I'd be curious what others think.
> In the case of #3, I wonder if it makes sense to then pull the Schema off
> the GetFlightInfo response...
>
> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray  wrote:
>
> > Hi All,
> >
> > I have been working on building an arrow flight source for spark. The goal
> > here is for Spark to be able to use a group of arrow flight endpoints to
> > get a dataset pulled over to spark in parallel.
> >
> > I am unsure of the best model for the spark <-> flight conversation and
> > wanted to get your opinion on the best way to go.
> >
> > I am breaking up the query to flight from spark into 3 parts:
> > 1) get the schema using GetFlightInfo. This is needed to do further lazy
> > operations in Spark
> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a different
> > argument. This returns the list endpoints on the parallel flight server.
> > The endpoints are not available till data is ready to be fetched, which is
> > done after the schema but is needed before DoGet is called.
> > 3) call get stream on all endpoints from 2
> >
> > I think I have to do each step however I don't like having to call getInfo
> > twice, it doesn't seem very elegant. I see a few options:
> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
> > differentiate the purpose of each call
> > 2) add an argument to GetFlightInfo to tell it its being called only for
> > the schema
> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just
> > the Schema in question
> > 4) use DoAction and wrap the expected FlightInfo in a Result
> >
> > I am aware that 4 is probably the least disruptive but I'm also not a fan
> > as (to me) it implies performing an action on the server side. Suggestions
> > 2 & 3 are larger changes and I am reluctant to do that unless there is a
> > consensus here. None of them are great options and I am wondering what
> > everyone thinks the best approach might be? Particularly as I think this is
> > likely to come up in more applications than just spark.
> >
> > Best,
> > Ryan
> >


[jira] [Created] (ARROW-5820) [Release] Remove undefined variable check from verify script

2019-07-01 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5820:
---

 Summary: [Release] Remove undefined variable check from verify 
script
 Key: ARROW-5820
 URL: https://issues.apache.org/jira/browse/ARROW-5820
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei
 Fix For: 0.14.0


External shell scripts may refer unbound variable:

{noformat}
/tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
line 55: PS1: unbound variable
{noformat}


https://lists.apache.org/thread.html/ebe8551eed2353b248b19084810ff454942b55470b9cf5837aa6cf79@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-01 Thread Wes McKinney
hi Micah,

Sorry for the delay in feedback. I looked at the document and it seems
like a reasonable perspective about forward- and
backward-compatibility.

It seems like the main thing you are proposing is to apply Semantic
Versioning to Format and Library versions separately. That's an
interesting idea, my thought had been to have a version number that is
FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is
more flexible in some ways, so let me clarify for others reading

In what you are proposing, the next release would be:

Format version: 1.0.0
Library version: 1.0.0

Suppose that 20 major versions down the road we stand at

Format version: 1.5.0
Library version: 20.0.0

The minor version of the Format would indicate that there are
additions, like new elements in the Type union, but otherwise backward
and forward compatible. So the Minor version means "new things, but
old clients will not be disrupted if those new things are not used".
We've already been doing this since the V4 Format iteration but we
have not had a way to signal that there may be new features. As a
corollary to this, I wonder if we should create a dual version in the
metadata

PROTOCOL VERSION: (what is currently MetadataVersion, V2)
FEATURE VERSION: not tracked at all

So Minor version bumps in the format would trigger a bump in the
FeatureVersion. Note that we don't really have a mechanism for clients
and servers to report to each other what features they support, so
this could help with that when for applications where it might matter.

Should backward/forward compatibility be disrupted in the future, then
a change to the major version would be required. So in year 2025, say,
we might decide that we want to do:

Format version: 2.0.0
Library version: 21.0.0

The Format version would live in the project's Documentation, so the
Apache releases are only the library version.

Regarding your open questions:

1. Should we clean up "warts" on the specification, like redundant information

I don't think it's necessary. So if Metadata V5 is Format Version
1.0.0 (currently we are V4, but we're discussing some possible
non-forward compatible changes...) I think that's OK. None of these
things are "hurting" anything

2. Do we need additional mechanisms for marking some features as experimental?

Not sure, but I think this can be mostly addressed through
documentation. Flight will still be experimental in 1.0.0, for
example.

3. Do we need protocol negotiation mechanisms in Flight

Could you explain what you mean? Are you thinking if there is some
major revamp of the protocol and you need to switch between a "V1
Flight Protocol" and a "V2 Flight Protocol"?

- Wes

On Thu, Jun 13, 2019 at 2:17 AM Micah Kornfield  wrote:
>
> Hi Everyone,
> I think there might be some ideas that we still need to reach consensus on
> for how the format and libraries evolve in a post-1.0.0 release world.
>  Specifically, I think we need to agree on definitions for
> backwards/forwards compatibility and its implications for versioning the
> format.
>
> To this end I put some thoughts down in a Google Doc [1] for the purposes
> of discussion.  Comments welcome.  I will start threads for any comments in
> the document that seem to warrant further discussion, and once we reach
> consensus I can create a patch to document what we decide on as part of the
> specification.
>
> Thanks,
> Micah
>
> [1]
> https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#


Re: RecordBatch with Tensors/Arrays

2019-07-01 Thread Wes McKinney
hi Andrew,

I'm copying dev@ just so more folks are in the loop

On Wed, Jun 19, 2019 at 9:13 AM Andrew Spott  wrote:
>
> I was told to post this here, rather than as an issue on Github.
>
> 
>
> I'm looking to serialize data that looks something like this:
>
> ```
> record = { "predicted": ,
>   "truth": ,
>   "loss": ,
>   "index": }
>
> data = [
> pa.array([record, record, record]),
> pa.array([, , ])
> pa.array([, , ])
> ]
>
> batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
> ```
>
> But I'm not sure how to do that, or even if what I'm trying to do is the 
> right way to do it.

We don't support tensors/ndarrays as first-class value types in the
Python or C++ libraries. This could be done hypothetically using the
new ExtensionType facility. Tensor values would be embedded in an
Arrow Binary column.

There is already ARROW-1614 open for this. I also opened ARROW-5819
about implementing the Python-side plumbing around this

Another possible option is to infer list<...> types from ndarrays
(e.g. list> from an ndarray of ndim=2 and dtype=float64),
but this has not been implemented.

>
> What is the difference between `pa.array` and `pa.list_`?  This formulation 
> is an array of structs, but is the struct of arrays formulation of this 
> possible? i.e.:
>

* The return value of pa.array is an Array object, which wraps the C++
arrow::Array type, the base class for value sequences. It's data, not
metadata
* pa.list_ returns an instance of ListType, which is a DataType
subclass. It's metadata, not data

> ```
> data = [
> pa.array([ ,  ,  
> ]),
> pa.array([ ,  ,  
> ]),
> pa.array([, , ]),
> ...
> ]
> ```
>
> Which doesn't currently work.  It seems that there is a separation between 
> '1d arraylike' datatypes and 'pythonlike' datatypes (and 'nd arraylike' 
> datatypes), so I can't have a struct of an array.
>

Right. ndarrays as array cell values are not natively part of the
Arrow columnar format. But they could be supported through extensions.
This would be a nice project for someone to take on in the future

- Wes

> -Andrew


[jira] [Created] (ARROW-5819) [Python] Store sequences of arbitrary ndarrays (with same type) in Tensor value type

2019-07-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5819:
---

 Summary: [Python] Store sequences of arbitrary ndarrays (with same 
type) in Tensor value type
 Key: ARROW-5819
 URL: https://issues.apache.org/jira/browse/ARROW-5819
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


This can be implemented using extension types, based on outcome of ARROW-1614



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5818) [Java][Gandiva] support varlen output vectors

2019-07-01 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5818:
-

 Summary: [Java][Gandiva] support varlen output vectors
 Key: ARROW-5818
 URL: https://issues.apache.org/jira/browse/ARROW-5818
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-01 Thread Wes McKinney
The C++/Python source build looks fine to me on the Windows side -- I
added Flight support in

https://github.com/apache/arrow/pull/4770

I opened https://issues.apache.org/jira/browse/ARROW-5817 as there is
a risk that Flight Python tests might be silently skipped. We check in
our Python package builds that pyarrow.flight can be imported
successfully so I don't think those packages are at risk of having a
problem

On Mon, Jul 1, 2019 at 11:48 AM Wes McKinney  wrote:
>
> hi Antoine, I'm not sure the origin of the conda.sh failure, have you
> tried removing any bashrc stuff related to the Anaconda distribution
> that you develop against?
>
> With the following patch I'm able to run the binary verification
>
> https://github.com/apache/arrow/pull/4768
>
> but it failed with
>
> https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a
>
> Indeed a sig is missing from bintray. I was able to get the parallel
> build to run on my machine (but it failed when I piped stdin/stdout to
> a file) but I also found a bad sig
>
> https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5
>
> I'm going to work on verifying more components. C# is failing with
>
> https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f
>
> but I don't think that should block the release (it would be nice if
> it passed though)
>
> I'm going to work on the Windows verification script and see if I can
> add Flight support to it
>
> All in all appears that an RC1 may be warranted unless the signature
> issues can be remedied in RC0. Seems like we might need to find an
> artifact staging solution that is not Bintray if API rate limits are
> going to be a problem.
>
> - Wes
>
> On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou  wrote:
> >
> >
> > On Ubuntu 18.04:
> >
> > - failed to verify binaries
> >
> > """
> > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
> > for details.'
> > Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.
> > """
> >
> > There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
> > zombie curl processes running...
> >
> > - failed to verify sources
> >
> > """
> > + export PATH
> > /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
> > line 55: PS1: unbound variable
> > + ask_conda=
> > + return 1
> > + cleanup
> > + '[' no = yes ']'
> > + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X
> > for details.'
> > Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details.
> > """
> >
> > There's no details in /tmp/arrow-0.14.0.yum2X
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> >
> >
> > Le 01/07/2019 à 07:32, Sutou Kouhei a écrit :
> > > Hi,
> > >
> > > I would like to propose the following release candidate (RC0) of Apache
> > > Arrow version 0.14.0. This is a release consiting of 618
> > > resolved JIRA issues[1].
> > >
> > > This release candidate is based on commit:
> > > a591d76ad9a657110368aa422bb00f4010cb6b6e [2]
> > >
> > > The source release rc0 is hosted at [3].
> > > The binary artifacts are hosted at [4][5][6][7].
> > > The changelog is located at [8].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [9] for how to validate a release candidate.
> > >
> > > NOTE: You must use verify-release-candidate.sh at master.
> > > I've fixed some problems after apache-arrow-0.14.0 tag.
> > > C#'s "sourcelink test" is fragile. (Network related problem?)
> > > It may be better that we add retry logic to "sourcelink test".
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow 0.14.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow 0.14.0 because...
> > >
> > > [1]: 
> > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0
> > > [2]: 
> > > https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e
> > > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0
> > > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0
> > > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0
> > > [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0
> > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0
> > > [8]: 
> > > https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md
> > > [9]: 
> > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >


[jira] [Created] (ARROW-5817) [Python] Use pytest marks for Flight test to avoid silently skipping unit tests due to import failures

2019-07-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5817:
---

 Summary: [Python] Use pytest marks for Flight test to avoid 
silently skipping unit tests due to import failures
 Key: ARROW-5817
 URL: https://issues.apache.org/jira/browse/ARROW-5817
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.14.0


The approach used to determine whether or not Flight has been built will fail 
silently if the extension is built but there is an ImportError caused by 
linking or other issues 

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_flight.py#L35

We should use the same "auto" approach as other optional components (see 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/conftest.py#L40)
 with the option for forced opt-in (so that ImportError does not cause silently 
skipping) so that {{--flight}} will force the tests to run if we expect them to 
work




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-01 Thread Wes McKinney
hi Antoine, I'm not sure the origin of the conda.sh failure, have you
tried removing any bashrc stuff related to the Anaconda distribution
that you develop against?

With the following patch I'm able to run the binary verification

https://github.com/apache/arrow/pull/4768

but it failed with

https://gist.github.com/wesm/711ae3d66c942db293dba55ff237871a

Indeed a sig is missing from bintray. I was able to get the parallel
build to run on my machine (but it failed when I piped stdin/stdout to
a file) but I also found a bad sig

https://gist.github.com/wesm/2404d55e087cc3982d93e53c83df95d5

I'm going to work on verifying more components. C# is failing with

https://gist.github.com/wesm/985146df6944a1aade331c4bd1519f1f

but I don't think that should block the release (it would be nice if
it passed though)

I'm going to work on the Windows verification script and see if I can
add Flight support to it

All in all appears that an RC1 may be warranted unless the signature
issues can be remedied in RC0. Seems like we might need to find an
artifact staging solution that is not Bintray if API rate limits are
going to be a problem.

- Wes

On Mon, Jul 1, 2019 at 3:48 AM Antoine Pitrou  wrote:
>
>
> On Ubuntu 18.04:
>
> - failed to verify binaries
>
> """
> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
> for details.'
> Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.
> """
>
> There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
> zombie curl processes running...
>
> - failed to verify sources
>
> """
> + export PATH
> /tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
> line 55: PS1: unbound variable
> + ask_conda=
> + return 1
> + cleanup
> + '[' no = yes ']'
> + echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X
> for details.'
> Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details.
> """
>
> There's no details in /tmp/arrow-0.14.0.yum2X
>
> Regards
>
> Antoine.
>
>
>
>
>
> Le 01/07/2019 à 07:32, Sutou Kouhei a écrit :
> > Hi,
> >
> > I would like to propose the following release candidate (RC0) of Apache
> > Arrow version 0.14.0. This is a release consiting of 618
> > resolved JIRA issues[1].
> >
> > This release candidate is based on commit:
> > a591d76ad9a657110368aa422bb00f4010cb6b6e [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > NOTE: You must use verify-release-candidate.sh at master.
> > I've fixed some problems after apache-arrow-0.14.0 tag.
> > C#'s "sourcelink test" is fragile. (Network related problem?)
> > It may be better that we add retry logic to "sourcelink test".
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 0.14.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 0.14.0 because...
> >
> > [1]: 
> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0
> > [2]: 
> > https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e
> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0
> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0
> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0
> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0
> > [8]: 
> > https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md
> > [9]: 
> > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >
> >
> > Thanks,
> > --
> > kou
> >


[jira] [Created] (ARROW-5816) [Release] Parallel curl does not work reliably in verify-release-candidate-sh

2019-07-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5816:
---

 Summary: [Release] Parallel curl does not work reliably in 
verify-release-candidate-sh
 Key: ARROW-5816
 URL: https://issues.apache.org/jira/browse/ARROW-5816
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.14.0


Script can exit early without waiting for curl processes to finish



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5815) [Java] Support swap functionality for fixed-width vectors

2019-07-01 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5815:
---

 Summary: [Java] Support swap functionality for fixed-width vectors
 Key: ARROW-5815
 URL: https://issues.apache.org/jira/browse/ARROW-5815
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Support swapping data elements for fixed-width vectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5814) [Java] Implement a HashMap for DictionaryEncoder

2019-07-01 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5814:
-

 Summary: [Java] Implement a  HashMap for 
DictionaryEncoder
 Key: ARROW-5814
 URL: https://issues.apache.org/jira/browse/ARROW-5814
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


As a follow-up of 
[ARROW-5726|https://issues.apache.org/jira/browse/ARROW-5726]. Implement a 
Map for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt  5  31151.345 ± 1661.878 
ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt  5  
15549.902 ± 771.647 ns/op



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-01 Thread Antoine Pitrou


On Ubuntu 18.04:

- failed to verify binaries

"""
+ echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU
for details.'
Failed to verify release candidate. See /tmp/arrow-0.14.0.gucvU for details.
"""

There's no details in /tmp/arrow-0.14.0.gucvU. The script left a lot of
zombie curl processes running...

- failed to verify sources

"""
+ export PATH
/tmp/arrow-0.14.0.yum2X/apache-arrow-0.14.0/test-miniconda/etc/profile.d/conda.sh:
line 55: PS1: unbound variable
+ ask_conda=
+ return 1
+ cleanup
+ '[' no = yes ']'
+ echo 'Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X
for details.'
Failed to verify release candidate. See /tmp/arrow-0.14.0.yum2X for details.
"""

There's no details in /tmp/arrow-0.14.0.yum2X

Regards

Antoine.





Le 01/07/2019 à 07:32, Sutou Kouhei a écrit :
> Hi,
> 
> I would like to propose the following release candidate (RC0) of Apache
> Arrow version 0.14.0. This is a release consiting of 618
> resolved JIRA issues[1].
> 
> This release candidate is based on commit:
> a591d76ad9a657110368aa422bb00f4010cb6b6e [2]
> 
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
> 
> NOTE: You must use verify-release-candidate.sh at master.
> I've fixed some problems after apache-arrow-0.14.0 tag.
> C#'s "sourcelink test" is fragile. (Network related problem?)
> It may be better that we add retry logic to "sourcelink test".
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 0.14.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 0.14.0 because...
> 
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.0
> [2]: 
> https://github.com/apache/arrow/tree/a591d76ad9a657110368aa422bb00f4010cb6b6e
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.0-rc0
> [4]: https://bintray.com/apache/arrow/centos-rc/0.14.0-rc0
> [5]: https://bintray.com/apache/arrow/debian-rc/0.14.0-rc0
> [6]: https://bintray.com/apache/arrow/python-rc/0.14.0-rc0
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.0-rc0
> [8]: 
> https://github.com/apache/arrow/blob/a591d76ad9a657110368aa422bb00f4010cb6b6e/CHANGELOG.md
> [9]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> 
> 
> Thanks,
> --
> kou
>