Re: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Micah Kornfield
>
> Since I develop on an AVX512-capable machine, if we have runtime
> dispatching then it should be able to test all variants of a function
> from a single executable / test run rather than having to produce
> multiple builds and test them separately, right?

Yes, but I think the same of true without runtime dispatching.  We might
have different mental models for runtime dispatching so I'll put up a
concrete example.  If we want optimized code for "some_function" it would
like like:

#ifdef HAVE_AVX512
void some_function_512() {
...
}
#endif

void some_function_base() {
...
}

// static dispatching
void some_function() {
#ifdef HAVE_AVX512
some_function_512();
#else
some_function_base();
#endif
}

// dynamic dispatch
void some_function() {
   static void()* chosen_function = Choose(cpu_info, _function_512,
_function_base);
   *chosen_function();
}

In both cases, we  need to have a tests which call into some_function_512()
and some_function_base().  It is possible with runtime dispatching we can
write code in tests as something like:

for (CpuInfo info : all_supported_architectures) {
TEST(Choose(info, _function_512, _function_base));
}

But I think there is likely something equivalent that we could to do with
macro magic.

Did you have something different in mind?

Micah





On Tue, May 12, 2020 at 8:31 PM Wes McKinney  wrote:

> On Tue, May 12, 2020 at 9:47 PM Yibo Cai  wrote:
> >
> > Thanks Wes, I'm glad to see this feature coming.
> >
> >  From history talks, the main concern is runtime dispatcher may cause
> performance issue.
> > Personally, I don't think it's a big problem. If we're using SIMD, it
> must be targeting some time consuming code.
> >
> > But we do need to take care some issues. E.g, I see code like this:
> > for (int i = 0; i < n; ++i) {
> >simd_code();
> > }
> > With runtime dispatcher, it becomes an indirect function call in each
> iteration.
> > We should change the code to move the loop inside simd_code().
>
> To be clear, I'm referring to SIMD-optimized code that operates on
> batches of data. The overhead of choosing an implementation based on a
> global settings object should not be meaningful. If there is
> performance-sensitive code at inline call sites then I agree that it
> is an issue. I don't think that characterizes most of the anticipated
> work in Arrow, though, since functions generally will process a
> chunk/array of data at time (see, e.g. Parquet encoding/decoding work
> recently).
>
> > It would be better if you can consider architectures other than x86(at
> framework level).
> > Ignore it if it costs much effort. We can always improve later.
> >
> > Yibo
> >
> > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > hi,
> > >
> > > We've started to receive a number of patches providing SIMD operations
> > > for both x86 and ARM architectures. Most of these patches make use of
> > > compiler definitions to toggle between code paths at compile time.
> > >
> > > This is problematic for a few reasons:
> > >
> > > * Binaries that are shipped (e.g. in Python) must generally be
> > > compiled for a broad set of supported compilers. That means that AVX2
> > > / AVX512 optimizations won't be available in these builds for
> > > processors that have them
> > > * Poses a maintainability and testing problem (hard to test every
> > > combination, and it is not practical for local development to compile
> > > every combination, which may cause drawn out test/CI/fix cycles)
> > >
> > > Other projects (e.g. NumPy) have taken the approach of building
> > > binaries that contain multiple variants of a function with different
> > > levels of SIMD, and then choosing at runtime which one to execute
> > > based on what features the CPU supports. This seems like what we
> > > ultimately need to do in Apache Arrow, and if we continue to accept
> > > patches that do not do this, it will be much more work later when we
> > > have to refactor things to runtime dispatching.
> > >
> > > We have some PRs in the queue related to SIMD. Without taking a heavy
> > > handed approach like starting to veto PRs, how would everyone like to
> > > begin to address the runtime dispatching problem?
> > >
> > > Note that the Kernels revamp project I am working on right now will
> > > also facilitate runtime SIMD kernel dispatching for array expression
> > > evaluation.
> > >
> > > Thanks,
> > > Wes
> > >
>


Re: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Wes McKinney
On Tue, May 12, 2020 at 9:47 PM Yibo Cai  wrote:
>
> Thanks Wes, I'm glad to see this feature coming.
>
>  From history talks, the main concern is runtime dispatcher may cause 
> performance issue.
> Personally, I don't think it's a big problem. If we're using SIMD, it must be 
> targeting some time consuming code.
>
> But we do need to take care some issues. E.g, I see code like this:
> for (int i = 0; i < n; ++i) {
>simd_code();
> }
> With runtime dispatcher, it becomes an indirect function call in each 
> iteration.
> We should change the code to move the loop inside simd_code().

To be clear, I'm referring to SIMD-optimized code that operates on
batches of data. The overhead of choosing an implementation based on a
global settings object should not be meaningful. If there is
performance-sensitive code at inline call sites then I agree that it
is an issue. I don't think that characterizes most of the anticipated
work in Arrow, though, since functions generally will process a
chunk/array of data at time (see, e.g. Parquet encoding/decoding work
recently).

> It would be better if you can consider architectures other than x86(at 
> framework level).
> Ignore it if it costs much effort. We can always improve later.
>
> Yibo
>
> On 5/13/20 9:46 AM, Wes McKinney wrote:
> > hi,
> >
> > We've started to receive a number of patches providing SIMD operations
> > for both x86 and ARM architectures. Most of these patches make use of
> > compiler definitions to toggle between code paths at compile time.
> >
> > This is problematic for a few reasons:
> >
> > * Binaries that are shipped (e.g. in Python) must generally be
> > compiled for a broad set of supported compilers. That means that AVX2
> > / AVX512 optimizations won't be available in these builds for
> > processors that have them
> > * Poses a maintainability and testing problem (hard to test every
> > combination, and it is not practical for local development to compile
> > every combination, which may cause drawn out test/CI/fix cycles)
> >
> > Other projects (e.g. NumPy) have taken the approach of building
> > binaries that contain multiple variants of a function with different
> > levels of SIMD, and then choosing at runtime which one to execute
> > based on what features the CPU supports. This seems like what we
> > ultimately need to do in Apache Arrow, and if we continue to accept
> > patches that do not do this, it will be much more work later when we
> > have to refactor things to runtime dispatching.
> >
> > We have some PRs in the queue related to SIMD. Without taking a heavy
> > handed approach like starting to veto PRs, how would everyone like to
> > begin to address the runtime dispatching problem?
> >
> > Note that the Kernels revamp project I am working on right now will
> > also facilitate runtime SIMD kernel dispatching for array expression
> > evaluation.
> >
> > Thanks,
> > Wes
> >


Re: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Wes McKinney
On Tue, May 12, 2020 at 10:19 PM Micah Kornfield  wrote:
>
> Hi Wes,
> I think you highlighted the two issues well, but I think they are somewhat
> orthogonal and runtime dispatching only addresses the binary availability
> of the optimizations (but actually makes testing harder because it can
> potentially hide untested code paths).

Since I develop on an AVX512-capable machine, if we have runtime
dispatching then it should be able to test all variants of a function
from a single executable / test run rather than having to produce
multiple builds and test them separately, right?

Presumably the SIMD-level at runtime would be configurable, so you
could either let it be automatically selected based on your CPU
capabilities or set manually (e.g. if you want to do perf testing with
SIMD vs. no SIMD at runtime).

> Personally, I think it is valuable to have SIMD optimization in the code
> base even if our binaries aren't shipped with them as long as we have
> sufficient regression testing.
>
> For testability, I think there are two issues:
> A.  Resources available to test architecture specific code -  To solve this
> issue I think we choose a "latest" architecture to target.  Community
> members that want to target a more modern architecture than the community
> agreed upon architecture  would have the onus to augment testing resources
> with that architecture.  The recent Big-Endian CI coverage is a good
> example of this.  I don't think it is heavy handed to reject PRs if we
> don't have sufficient CI coverage.
>
> B.  Ensuring we have a sufficient test coverage for all code paths.  I
> think this breaks down into how we structure our code.  I know I've
> submitted a recent PR that makes it difficult to test each path separately,
> I will try to address this before submission.  Note, that that structuring
> the code so that each path can be tested independently is a precursor to
> runtime dispatch.  Once we agree on a "latest" architecture, if the code is
> structured appropriately, we should get sufficient code coverage by
> targeting the community decided "latest" architecture for most builds (and
> not having to do a full matrix of architectural changes).
>
> Thanks,
> Micah
>
>
>
>
>
>
> On Tue, May 12, 2020 at 6:47 PM Wes McKinney  wrote:
>
> > hi,
> >
> > We've started to receive a number of patches providing SIMD operations
> > for both x86 and ARM architectures. Most of these patches make use of
> > compiler definitions to toggle between code paths at compile time.
> >
> > This is problematic for a few reasons:
> >
> > * Binaries that are shipped (e.g. in Python) must generally be
> > compiled for a broad set of supported compilers. That means that AVX2
> > / AVX512 optimizations won't be available in these builds for
> > processors that have them
> > * Poses a maintainability and testing problem (hard to test every
> > combination, and it is not practical for local development to compile
> > every combination, which may cause drawn out test/CI/fix cycles)
> >
> > Other projects (e.g. NumPy) have taken the approach of building
> > binaries that contain multiple variants of a function with different
> > levels of SIMD, and then choosing at runtime which one to execute
> > based on what features the CPU supports. This seems like what we
> > ultimately need to do in Apache Arrow, and if we continue to accept
> > patches that do not do this, it will be much more work later when we
> > have to refactor things to runtime dispatching.
> >
> > We have some PRs in the queue related to SIMD. Without taking a heavy
> > handed approach like starting to veto PRs, how would everyone like to
> > begin to address the runtime dispatching problem?
> >
> > Note that the Kernels revamp project I am working on right now will
> > also facilitate runtime SIMD kernel dispatching for array expression
> > evaluation.
> >
> > Thanks,
> > Wes
> >


Re: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Micah Kornfield
Hi Wes,
I think you highlighted the two issues well, but I think they are somewhat
orthogonal and runtime dispatching only addresses the binary availability
of the optimizations (but actually makes testing harder because it can
potentially hide untested code paths).

Personally, I think it is valuable to have SIMD optimization in the code
base even if our binaries aren't shipped with them as long as we have
sufficient regression testing.

For testability, I think there are two issues:
A.  Resources available to test architecture specific code -  To solve this
issue I think we choose a "latest" architecture to target.  Community
members that want to target a more modern architecture than the community
agreed upon architecture  would have the onus to augment testing resources
with that architecture.  The recent Big-Endian CI coverage is a good
example of this.  I don't think it is heavy handed to reject PRs if we
don't have sufficient CI coverage.

B.  Ensuring we have a sufficient test coverage for all code paths.  I
think this breaks down into how we structure our code.  I know I've
submitted a recent PR that makes it difficult to test each path separately,
I will try to address this before submission.  Note, that that structuring
the code so that each path can be tested independently is a precursor to
runtime dispatch.  Once we agree on a "latest" architecture, if the code is
structured appropriately, we should get sufficient code coverage by
targeting the community decided "latest" architecture for most builds (and
not having to do a full matrix of architectural changes).

Thanks,
Micah






On Tue, May 12, 2020 at 6:47 PM Wes McKinney  wrote:

> hi,
>
> We've started to receive a number of patches providing SIMD operations
> for both x86 and ARM architectures. Most of these patches make use of
> compiler definitions to toggle between code paths at compile time.
>
> This is problematic for a few reasons:
>
> * Binaries that are shipped (e.g. in Python) must generally be
> compiled for a broad set of supported compilers. That means that AVX2
> / AVX512 optimizations won't be available in these builds for
> processors that have them
> * Poses a maintainability and testing problem (hard to test every
> combination, and it is not practical for local development to compile
> every combination, which may cause drawn out test/CI/fix cycles)
>
> Other projects (e.g. NumPy) have taken the approach of building
> binaries that contain multiple variants of a function with different
> levels of SIMD, and then choosing at runtime which one to execute
> based on what features the CPU supports. This seems like what we
> ultimately need to do in Apache Arrow, and if we continue to accept
> patches that do not do this, it will be much more work later when we
> have to refactor things to runtime dispatching.
>
> We have some PRs in the queue related to SIMD. Without taking a heavy
> handed approach like starting to veto PRs, how would everyone like to
> begin to address the runtime dispatching problem?
>
> Note that the Kernels revamp project I am working on right now will
> also facilitate runtime SIMD kernel dispatching for array expression
> evaluation.
>
> Thanks,
> Wes
>


Re: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Yibo Cai

Thanks Wes, I'm glad to see this feature coming.

From history talks, the main concern is runtime dispatcher may cause 
performance issue.
Personally, I don't think it's a big problem. If we're using SIMD, it must be 
targeting some time consuming code.

But we do need to take care some issues. E.g, I see code like this:
for (int i = 0; i < n; ++i) {
  simd_code();
}
With runtime dispatcher, it becomes an indirect function call in each iteration.
We should change the code to move the loop inside simd_code().

It would be better if you can consider architectures other than x86(at 
framework level).
Ignore it if it costs much effort. We can always improve later.

Yibo

On 5/13/20 9:46 AM, Wes McKinney wrote:

hi,

We've started to receive a number of patches providing SIMD operations
for both x86 and ARM architectures. Most of these patches make use of
compiler definitions to toggle between code paths at compile time.

This is problematic for a few reasons:

* Binaries that are shipped (e.g. in Python) must generally be
compiled for a broad set of supported compilers. That means that AVX2
/ AVX512 optimizations won't be available in these builds for
processors that have them
* Poses a maintainability and testing problem (hard to test every
combination, and it is not practical for local development to compile
every combination, which may cause drawn out test/CI/fix cycles)

Other projects (e.g. NumPy) have taken the approach of building
binaries that contain multiple variants of a function with different
levels of SIMD, and then choosing at runtime which one to execute
based on what features the CPU supports. This seems like what we
ultimately need to do in Apache Arrow, and if we continue to accept
patches that do not do this, it will be much more work later when we
have to refactor things to runtime dispatching.

We have some PRs in the queue related to SIMD. Without taking a heavy
handed approach like starting to veto PRs, how would everyone like to
begin to address the runtime dispatching problem?

Note that the Kernels revamp project I am working on right now will
also facilitate runtime SIMD kernel dispatching for array expression
evaluation.

Thanks,
Wes



RE: [C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Du, Frank
Hi,

I totally agree that arrow should has a built-in support for runtime 
dispatching facilities just like other popular computing libs to fully utilize 
the modern hardware capacity, we feel arrow has great potential performance 
chance with the advanced cpu SIMD feature. 

It's ok for me to stop the current SIMD PR, only concern is how long a basic 
runtime policy can be ready to leverage? Dose the kernel refactoring include a 
runtime dispatching already?

Thanks,
Frank

-Original Message-
From: Wes McKinney  
Sent: Wednesday, May 13, 2020 9:46 AM
To: dev 
Subject: [C++] Runtime SIMD dispatching for Arrow

hi,

We've started to receive a number of patches providing SIMD operations for both 
x86 and ARM architectures. Most of these patches make use of compiler 
definitions to toggle between code paths at compile time.

This is problematic for a few reasons:

* Binaries that are shipped (e.g. in Python) must generally be compiled for a 
broad set of supported compilers. That means that AVX2 / AVX512 optimizations 
won't be available in these builds for processors that have them
* Poses a maintainability and testing problem (hard to test every combination, 
and it is not practical for local development to compile every combination, 
which may cause drawn out test/CI/fix cycles)

Other projects (e.g. NumPy) have taken the approach of building binaries that 
contain multiple variants of a function with different levels of SIMD, and then 
choosing at runtime which one to execute based on what features the CPU 
supports. This seems like what we ultimately need to do in Apache Arrow, and if 
we continue to accept patches that do not do this, it will be much more work 
later when we have to refactor things to runtime dispatching.

We have some PRs in the queue related to SIMD. Without taking a heavy handed 
approach like starting to veto PRs, how would everyone like to begin to address 
the runtime dispatching problem?

Note that the Kernels revamp project I am working on right now will also 
facilitate runtime SIMD kernel dispatching for array expression evaluation.

Thanks,
Wes


[C++] Runtime SIMD dispatching for Arrow

2020-05-12 Thread Wes McKinney
hi,

We've started to receive a number of patches providing SIMD operations
for both x86 and ARM architectures. Most of these patches make use of
compiler definitions to toggle between code paths at compile time.

This is problematic for a few reasons:

* Binaries that are shipped (e.g. in Python) must generally be
compiled for a broad set of supported compilers. That means that AVX2
/ AVX512 optimizations won't be available in these builds for
processors that have them
* Poses a maintainability and testing problem (hard to test every
combination, and it is not practical for local development to compile
every combination, which may cause drawn out test/CI/fix cycles)

Other projects (e.g. NumPy) have taken the approach of building
binaries that contain multiple variants of a function with different
levels of SIMD, and then choosing at runtime which one to execute
based on what features the CPU supports. This seems like what we
ultimately need to do in Apache Arrow, and if we continue to accept
patches that do not do this, it will be much more work later when we
have to refactor things to runtime dispatching.

We have some PRs in the queue related to SIMD. Without taking a heavy
handed approach like starting to veto PRs, how would everyone like to
begin to address the runtime dispatching problem?

Note that the Kernels revamp project I am working on right now will
also facilitate runtime SIMD kernel dispatching for array expression
evaluation.

Thanks,
Wes


Re: [DISCUSS] Need for Arrow 0.17.1 patch release (binary only?)

2020-05-12 Thread Krisztián Szűcs
Just pushed the maint-0.17.x branch - I had conflicts here and there.

Applied the following git commands:
https://gist.github.com/kszucs/ea75f09090a9ffdff07e51582af1f436

Submitted the corssbow packaging tasks to see nothing is missing.
I'll start with cutting RC0 tomorrow.

On Mon, May 11, 2020 at 7:23 PM Krisztián Szűcs
 wrote:
>
> On Fri, May 8, 2020 at 9:58 PM Wes McKinney  wrote:
> >
> > From the release milestone
> > (https://issues.apache.org/jira/projects/ARROW/versions/12348202) it
> > looks like there are a few remaining issues
> >
> > ARROW-8684 -- there is a PR but we don't understand why it works. If
> > we can't figure it out but the patch fixes the broken wheel issue we
> > may have to accept it for now
> > ARROW-8741: in progress
> > ARROW-8726: No patch yet available (does this need to block? since
> > this is a bug in bleeding edge functionality)
> >
> > Who would be able to cut the patch release? Do you think it will be
> > possible to reduce the steps involved with producing this release to
> > ease the work required of the RM?
> I'll do it.
>
> The patch release affects all of the C++ based packages we cannot really skip
> any steps of creating a release candidate but we can spare a couple of post
> release steps to not ship unaffected implementations.
>
> Additionally we'll need to include multiple packaging related patches to have
> passing build - I'm going through the changelog.
>
> >
> > On Thu, May 7, 2020 at 7:30 AM Antoine Pitrou  wrote:
> > >
> > >
> > > There's also ARROW-8728 and ARROW-8704.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 07/05/2020 à 14:25, Francois Saint-Jacques a écrit :
> > > > I'll add https://issues.apache.org/jira/browse/ARROW-8726 to the list.
> > > >
> > > > On Tue, May 5, 2020 at 6:52 PM Wes McKinney  wrote:
> > > >>
> > > >> Sorry I haven't had enough coffee today.
> > > >>
> > > >> The patches that still need to be resolved AFAICT are ARROW-8684 and
> > > >> ARROW-8706 (AKA PARQUET-1857), so it will take a little while yet
> > > >>
> > > >> On Tue, May 5, 2020 at 5:18 PM Wes McKinney  
> > > >> wrote:
> > > >>>
> > > >>> I just added it to the milestone in JIRA. In general if you want to
> > > >>> add a patch to a patch release, you can go into JIRA and add the fix
> > > >>> version
> > > >>>
> > > >>> I think the most critical patches have all been merged now, is there a
> > > >>> reason to delay much longer in cutting a release? We might use the
> > > >>> opportunity to produce a slimmed down "Release Management for Patch
> > > >>> Releases" guide with some steps omitted (hopefully)
> > > >>>
> > > >>> On Tue, May 5, 2020 at 2:55 PM Paul Taylor  
> > > >>> wrote:
> > > 
> > >  Would it be possible to include the variant.hpp update
> > >   for nvcc in 0.17.1?
> > > 
> > >  Thanks,
> > > 
> > >  Paul
> > > 
> > >  On 5/4/20 4:17 PM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > We have accumulated a few regressions
> > > >
> > > > ARROW-8657 https://github.com/apache/arrow/pull/7089
> > > > ARROW-8694 https://github.com/apache/arrow/pull/7103
> > > >
> > > > there may be a few others.
> > > >
> > > > I think we should try to make a "streamlined" patch release (after
> > > > surveying incoming bug reports for other serious regressions) if
> > > > possible focused on providing patched binaries to the impacted users
> > > > (in the above, this would be any user of the Parquet portion of the
> > > > C++ library). The hope would be to be able to trim down the work
> > > > required of a release manager in a normal major release in these
> > > > scenarios where we need to get out bugfixes sooner.
> > > >
> > > > Thoughts?
> > > >
> > > > Thanks
> > > > Wes


[jira] [Created] (ARROW-8779) [R] Unable to write Struct Layout to file (.arrow, .parquet)

2020-05-12 Thread Dominic Dennenmoser (Jira)
Dominic Dennenmoser created ARROW-8779:
--

 Summary: [R] Unable to write Struct Layout to file (.arrow, 
.parquet)
 Key: ARROW-8779
 URL: https://issues.apache.org/jira/browse/ARROW-8779
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.17.0, 0.16.0
Reporter: Dominic Dennenmoser


It seems there is no method implemented to write a StructArrow (within a 
TableArrow) to file. A common case would be list columns in a dataframe. If I 
have understood the documentation correctly, the should be realisable within 
the current C++ library framework.

I tested this with the follow df structure:
{code:none}
df
|-- ID 
|-- Data 
||-- a 
||-- b 
||-- c 
||-- d {code}
 I got the follow error message:
{code:none}
Error in Table__from_dots(dots, schema) : NotImplemented: Converting vector to 
arrow type struct, d: double> not implemented{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8778) [C++][gandiva] SelectionVector related test failed on big-endian platforms

2020-05-12 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-8778:
---

 Summary: [C++][gandiva] SelectionVector related test failed on 
big-endian platforms
 Key: ARROW-8778
 URL: https://issues.apache.org/jira/browse/ARROW-8778
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Kazuaki Ishizaki


These test failures in gandiva tests occur on big-endian platforms

{code}
...
[--] 11 tests from TestSelectionVector
[ RUN  ] TestSelectionVector.TestInt16Make
[   OK ] TestSelectionVector.TestInt16Make (0 ms)
[ RUN  ] TestSelectionVector.TestInt16MakeNegative
[   OK ] TestSelectionVector.TestInt16MakeNegative (1 ms)
[ RUN  ] TestSelectionVector.TestInt16Set
[   OK ] TestSelectionVector.TestInt16Set (0 ms)
[ RUN  ] TestSelectionVector.TestInt16PopulateFromBitMap
/arrow/cpp/src/gandiva/selection_vector_test.cc:116: Failure
Expected equality of these values:
  selection->GetIndex(0)
Which is: 56
  0
/arrow/cpp/src/gandiva/selection_vector_test.cc:117: Failure
Expected equality of these values:
  selection->GetIndex(1)
Which is: 61
  5
/arrow/cpp/src/gandiva/selection_vector_test.cc:118: Failure
Expected equality of these values:
  selection->GetIndex(2)
Which is: 65
  121
[  FAILED  ] TestSelectionVector.TestInt16PopulateFromBitMap (0 ms)
[ RUN  ] TestSelectionVector.TestInt16PopulateFromBitMapNegative
/arrow/cpp/src/gandiva/selection_vector_test.cc:137: Failure
Expected equality of these values:
  status.IsInvalid()
Which is: false
  true
[  FAILED  ] TestSelectionVector.TestInt16PopulateFromBitMapNegative (0 ms)
[ RUN  ] TestSelectionVector.TestInt32Set
[   OK ] TestSelectionVector.TestInt32Set (0 ms)
[ RUN  ] TestSelectionVector.TestInt32PopulateFromBitMap
/arrow/cpp/src/gandiva/selection_vector_test.cc:187: Failure
Expected equality of these values:
  selection->GetIndex(0)
Which is: 56
  0
/arrow/cpp/src/gandiva/selection_vector_test.cc:188: Failure
Expected equality of these values:
  selection->GetIndex(1)
Which is: 61
  5
/arrow/cpp/src/gandiva/selection_vector_test.cc:189: Failure
Expected equality of these values:
  selection->GetIndex(2)
Which is: 65
  121
[  FAILED  ] TestSelectionVector.TestInt32PopulateFromBitMap (0 ms)
[ RUN  ] TestSelectionVector.TestInt32MakeNegative
[   OK ] TestSelectionVector.TestInt32MakeNegative (0 ms)
[ RUN  ] TestSelectionVector.TestInt64Set
[   OK ] TestSelectionVector.TestInt64Set (0 ms)
[ RUN  ] TestSelectionVector.TestInt64PopulateFromBitMap
/arrow/cpp/src/gandiva/selection_vector_test.cc:252: Failure
Expected equality of these values:
  selection->GetIndex(0)
Which is: 56
  0
/arrow/cpp/src/gandiva/selection_vector_test.cc:253: Failure
Expected equality of these values:
  selection->GetIndex(1)
Which is: 61
  5
/arrow/cpp/src/gandiva/selection_vector_test.cc:254: Failure
Expected equality of these values:
  selection->GetIndex(2)
Which is: 65
  121
[  FAILED  ] TestSelectionVector.TestInt64PopulateFromBitMap (0 ms)
[ RUN  ] TestSelectionVector.TestInt64MakeNegative
[   OK ] TestSelectionVector.TestInt64MakeNegative (0 ms)
[--] 11 tests from TestSelectionVector (1 ms total)
[--] 2 tests from TestLruCache
[ RUN  ] TestLruCache.TestEvict
[   OK ] TestLruCache.TestEvict (0 ms)
[ RUN  ] TestLruCache.TestLruBehavior
[   OK ] TestLruCache.TestLruBehavior (0 ms)
[--] 2 tests from TestLruCache (0 ms total)
[--] 4 tests from TestToDateHolder
[ RUN  ] TestToDateHolder.TestSimpleDateTime
[   OK ] TestToDateHolder.TestSimpleDateTime (0 ms)
[ RUN  ] TestToDateHolder.TestSimpleDate
[   OK ] TestToDateHolder.TestSimpleDate (0 ms)
[ RUN  ] TestToDateHolder.TestSimpleDateTimeError
[   OK ] TestToDateHolder.TestSimpleDateTimeError (0 ms)
[ RUN  ] TestToDateHolder.TestSimpleDateTimeMakeError
[   OK ] TestToDateHolder.TestSimpleDateTimeMakeError (0 ms)
[--] 4 tests from TestToDateHolder (1 ms total)
[--] 3 tests from TestSimpleArena
[ RUN  ] TestSimpleArena.TestAlloc
[   OK ] TestSimpleArena.TestAlloc (0 ms)
[ RUN  ] TestSimpleArena.TestReset1
[   OK ] TestSimpleArena.TestReset1 (0 ms)
[ RUN  ] TestSimpleArena.TestReset2
[   OK ] TestSimpleArena.TestReset2 (0 ms)
[--] 3 tests from TestSimpleArena (0 ms total)
[--] 6 tests from TestLikeHolder
[ RUN  ] TestLikeHolder.TestMatchAny
[   OK ] TestLikeHolder.TestMatchAny (0 ms)
[ RUN  ] TestLikeHolder.TestMatchOne
[   OK ] TestLikeHolder.TestMatchOne (0 ms)
[ RUN  ] TestLikeHolder.TestPcreSpecial
[   OK ] TestLikeHolder.TestPcreSpecial (0 ms)
[ RUN  ] TestLikeHolder.TestRegexEscape
[   OK ] TestLikeHolder.TestRegexEscape (0 ms)
[ RUN  ] TestLikeHolder.TestDot
[   OK ] TestLikeHolder.TestDot (0 ms)
[ RUN  ] 

[jira] [Created] (ARROW-8777) [Rust] Parquet.rs does not support reading fixed-size binary fields.

2020-05-12 Thread Max Burke (Jira)
Max Burke created ARROW-8777:


 Summary: [Rust] Parquet.rs does not support reading fixed-size 
binary fields.
 Key: ARROW-8777
 URL: https://issues.apache.org/jira/browse/ARROW-8777
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Max Burke
Assignee: Max Burke






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8776) [FlightRPC][C++] Flight/C++ middleware don't receive headers on failed calls to Java servers

2020-05-12 Thread David Li (Jira)
David Li created ARROW-8776:
---

 Summary: [FlightRPC][C++] Flight/C++ middleware don't receive 
headers on failed calls to Java servers
 Key: ARROW-8776
 URL: https://issues.apache.org/jira/browse/ARROW-8776
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Affects Versions: 0.17.0
Reporter: David Li
Assignee: David Li
 Fix For: 1.0.0


For a failed call, gRPC/Java may consolidate headers with trailers, so 
Flight/C++ needs to check both headers and trailers to get any headers that may 
have been sent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8775) [C++][FlightRPC] Integration client doesn't run integration tests

2020-05-12 Thread David Li (Jira)
David Li created ARROW-8775:
---

 Summary: [C++][FlightRPC] Integration client doesn't run 
integration tests
 Key: ARROW-8775
 URL: https://issues.apache.org/jira/browse/ARROW-8775
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC, Integration
Reporter: David Li
Assignee: David Li


Looks like I rebased badly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[Rust] DataFusion threading model design discussion

2020-05-12 Thread Andy Grove
As part of ARROW-8774 [1], I have created a Google doc [2] where any
interested parties can collaborate on a design discussion, which I will
then document in the JIRA.

[1] https://issues.apache.org/jira/browse/ARROW-8774

[2]
https://docs.google.com/document/d/1_wc6diy3YrRgEIhVIGzrO5AK8yhwfjWlmKtGnvbsrrY/edit?usp=sharing

Thanks,

Andy.


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-05-12-0

2020-05-12 Thread Neal Richardson
The r-as-cran failure is spurious; I'm working on a fix on
https://issues.apache.org/jira/browse/ARROW-8768.

Neal

On Tue, May 12, 2020 at 3:11 AM Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2020-05-12-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0
>
> Failed Tasks:
> - test-conda-python-3.7-spark-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.7-spark-master
> - test-r-linux-as-cran:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-r-linux-as-cran
> - wheel-win-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp35m
> - wheel-win-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp36m
> - wheel-win-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp37m
> - wheel-win-cp38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp38
>
> Succeeded Tasks:
> - centos-6-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-centos-8-amd64
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-linux-gcc-py37
> - conda-linux-gcc-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-linux-gcc-py38
> - conda-osx-clang-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-win-vs2015-py37
> - conda-win-vs2015-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-win-vs2015-py38
> - debian-buster-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-debian-buster-amd64
> - debian-buster-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-debian-buster-arm64
> - debian-stretch-amd64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-debian-stretch-amd64
> - debian-stretch-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-debian-stretch-arm64
> - gandiva-jar-osx:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-gandiva-jar-xenial
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-homebrew-cpp
> - homebrew-r-autobrew:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-homebrew-r-autobrew
> - nuget:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-nuget
> - test-conda-cpp-valgrind:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-cpp-valgrind
> - test-conda-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-cpp
> - test-conda-python-3.6-pandas-0.23:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.6-pandas-0.23
> - test-conda-python-3.6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.6
> - 

[jira] [Created] (ARROW-8774) [Rust] [DataFusion] Improve threading model

2020-05-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-8774:
-

 Summary: [Rust] [DataFusion] Improve threading model
 Key: ARROW-8774
 URL: https://issues.apache.org/jira/browse/ARROW-8774
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 1.0.0


DataFusion currently spawns one thread per partition and this results in poor 
performance if there are more partitions than available cores/threads. It would 
be better to have a thread-pool that defaults to number of available cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8773) pyarrow schema.empty_table() does not preserve nullability of fields

2020-05-12 Thread Al Taylor (Jira)
Al Taylor created ARROW-8773:


 Summary: pyarrow schema.empty_table() does not preserve 
nullability of fields
 Key: ARROW-8773
 URL: https://issues.apache.org/jira/browse/ARROW-8773
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.0
 Environment: linux, pyarrow 0.17.0 installed via pipenv
Reporter: Al Taylor


Introduced by PR: [https://github.com/apache/arrow/pull/2589]

 

When a field in a schema is marked as not-nullable, calling empty_table() on 
the schema returns a table with nullable fields.

 

reproduction
{code:java}
>>> import pyarrow as pa
>>> s = pa.schema([pa.field('a', pa.int64(), nullable=False), pa.field('b', 
>>> pa.int64())])

>>> s

a: int64 not null
b: int64

>>> e = s.empty_table()
>>> e

pyarrow.Table
a: int64
b: int64

>>> e.schema

a: int64
b: int64

>>> assert s == e.schema
Traceback (most recent call last):
  File "", line 1, in 
AssertionError

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-05-12-0

2020-05-12 Thread Crossbow


Arrow Build Report for Job nightly-2020-05-12-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0

Failed Tasks:
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.7-spark-master
- test-r-linux-as-cran:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-r-linux-as-cran
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp35m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-centos-8-amd64
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-debian-stretch-arm64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-travis-homebrew-r-autobrew
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-nuget
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-cpp
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-12-0-github-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 

[jira] [Created] (ARROW-8772) [C++] Expand SumKernel benchmark to more types

2020-05-12 Thread Frank Du (Jira)
Frank Du created ARROW-8772:
---

 Summary: [C++] Expand SumKernel benchmark to more types
 Key: ARROW-8772
 URL: https://issues.apache.org/jira/browse/ARROW-8772
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


Expand SumKernel benchmark to cover more types, Float, Double, Int8, Int16, 
Int32, Int64.

Currently it only has Int64 item, useful for further optimize job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)