[jira] [Created] (ARROW-9700) [Python] create_library_symlinks doesn't work in macos

2020-08-12 Thread Shawn Yang (Jira)
Shawn Yang created ARROW-9700:
-

 Summary: [Python] create_library_symlinks doesn't work in macos
 Key: ARROW-9700
 URL: https://issues.apache.org/jira/browse/ARROW-9700
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Shawn Yang


pyarrow.create_library_symlinks() doesn't create symlink on macos.

```

def get_symlink_path(hard_path):
  return '.'.join((hard_path.split('.')[0], 'dylib'))

```

should be changed to

```

def get_symlink_path(hard_path):
  splits = hard_path.split('.')
  splits.pop(-2)
  return '.'.join(splits)

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9701) [Java][CI] Add a test job on s390x

2020-08-12 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-9701:
---

 Summary: [Java][CI] Add a test job on s390x
 Key: ARROW-9701
 URL: https://issues.apache.org/jira/browse/ARROW-9701
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki
Assignee: Kazuaki Ishizaki






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9700) [Python] create_library_symlinks doesn't work in macos

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9700:
--
Labels: pull-request-available  (was: )

> [Python] create_library_symlinks doesn't work in macos
> --
>
> Key: ARROW-9700
> URL: https://issues.apache.org/jira/browse/ARROW-9700
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Shawn Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> pyarrow.create_library_symlinks() doesn't create symlink on macos.
> ```
> def get_symlink_path(hard_path):
>   return '.'.join((hard_path.split('.')[0], 'dylib'))
> ```
> should be changed to
> ```
> def get_symlink_path(hard_path):
>   splits = hard_path.split('.')
>   splits.pop(-2)
>   return '.'.join(splits)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9701) [Java][CI] Add a test job on s390x

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9701:


Assignee: Kazuaki Ishizaki  (was: Apache Arrow JIRA Bot)

> [Java][CI] Add a test job on s390x
> --
>
> Key: ARROW-9701
> URL: https://issues.apache.org/jira/browse/ARROW-9701
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9701) [Java][CI] Add a test job on s390x

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9701:


Assignee: Apache Arrow JIRA Bot  (was: Kazuaki Ishizaki)

> [Java][CI] Add a test job on s390x
> --
>
> Key: ARROW-9701
> URL: https://issues.apache.org/jira/browse/ARROW-9701
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9701) [Java][CI] Add a test job on s390x

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9701:
--
Labels: pull-request-available  (was: )

> [Java][CI] Add a test job on s390x
> --
>
> Key: ARROW-9701
> URL: https://issues.apache.org/jira/browse/ARROW-9701
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9702) [C++] Move bpacking simd to runtime path

2020-08-12 Thread Frank Du (Jira)
Frank Du created ARROW-9702:
---

 Summary: [C++] Move bpacking simd to runtime path
 Key: ARROW-9702
 URL: https://issues.apache.org/jira/browse/ARROW-9702
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


Currently there are some static avx512 SIMD codes for unpack32 function, it 
should be reworked to runtime path. Also it can be implemented with avx2.

 

The unpack32 API is used by PlainDecodingBoolean.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9698) [C++] Revert "Add -NDEBUG flag to arrow.pc"

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9698:
--
Labels: pull-request-available  (was: )

> [C++] Revert "Add -NDEBUG flag to arrow.pc"
> ---
>
> Key: ARROW-9698
> URL: https://issues.apache.org/jira/browse/ARROW-9698
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Brian Dunlay
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-2275 introduced a `#ifndef NDEBUG` check around a function which caused 
> the function to to be omitted during release builds.
> As a workaround, ARROW-2313 added -DNDEBUG flags to the [pkg-config cmake 
> definition|https://github.com/apache/arrow/pull/1752] so that anyone using 
> the release build of the package would not run into any issues with the 
> missing code. As a result of this change,`pkg-config arrow --cflags` results 
> in -DNDEBUG being added as a compiler flag, forcing itself on the downstream 
> project whenever the dependency is located using pkg-config --.
> The original `#ifndef NDEBUG` change was 
> [reverted|https://github.com/apache/arrow/pull/1756] with ARROW-2316, but the 
> workaround in ARROW-2313 remains.
> I am proposing to revert the workaround in ARROW-2313 so that downstream 
> projects may link against the release build of arrow without adopting the 
> -DNDEBUG flag unnecessarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9702) [C++] Move bpacking simd to runtime path

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9702:
--
Labels: pull-request-available  (was: )

> [C++] Move bpacking simd to runtime path
> 
>
> Key: ARROW-9702
> URL: https://issues.apache.org/jira/browse/ARROW-9702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently there are some static avx512 SIMD codes for unpack32 function, it 
> should be reworked to runtime path. Also it can be implemented with avx2.
>  
> The unpack32 API is used by PlainDecodingBoolean.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9702) [C++] Move bpacking simd to runtime path

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9702:


Assignee: Apache Arrow JIRA Bot  (was: Frank Du)

> [C++] Move bpacking simd to runtime path
> 
>
> Key: ARROW-9702
> URL: https://issues.apache.org/jira/browse/ARROW-9702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently there are some static avx512 SIMD codes for unpack32 function, it 
> should be reworked to runtime path. Also it can be implemented with avx2.
>  
> The unpack32 API is used by PlainDecodingBoolean.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9702) [C++] Move bpacking simd to runtime path

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9702:


Assignee: Frank Du  (was: Apache Arrow JIRA Bot)

> [C++] Move bpacking simd to runtime path
> 
>
> Key: ARROW-9702
> URL: https://issues.apache.org/jira/browse/ARROW-9702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently there are some static avx512 SIMD codes for unpack32 function, it 
> should be reworked to runtime path. Also it can be implemented with avx2.
>  
> The unpack32 API is used by PlainDecodingBoolean.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9643) [C++] Illegal instruction on haswell cpu

2020-08-12 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176259#comment-17176259
 ] 

Krisztian Szucs commented on ARROW-9643:


Should not be included in 1.0.1 since 
https://issues.apache.org/jira/browse/ARROW-9398 was not part of the 1.0.0 
release.

> [C++] Illegal instruction on haswell cpu
> 
>
> Key: ARROW-9643
> URL: https://issues.apache.org/jira/browse/ARROW-9643
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Frank Du
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
> Attachments: CMakeCache.txt, avx512-workaround.diff
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Some unit tests failed with *"Illegal instruction"* on Intel E5-2650 
> (haswell).
> {noformat}
> $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=ON 
> -DARROW_COMPUTE=ON ..
> $ ninja unittest
> ..
> The following tests FAILED:
>11 - arrow-stl-test (Failed)
>14 - arrow-diff-test (Failed)
>22 - arrow-compute-internals-test (Failed)
>23 - arrow-compute-scalar-test (Failed)
>24 - arrow-compute-vector-test (Failed)
>25 - arrow-compute-aggregate-test (Failed)
> $ release/arrow-stl-test
> ..
> Illegal instruction
> $ lscpu
> ..
> Model name:  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
> {noformat}
> Using git bitsect I find the PR causes this error, 
> [https://github.com/apache/arrow/pull/7700]
> [~frankdu], would you double check it? thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9643) [C++] Illegal instruction on haswell cpu

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9643:
---
Fix Version/s: (was: 1.0.1)

> [C++] Illegal instruction on haswell cpu
> 
>
> Key: ARROW-9643
> URL: https://issues.apache.org/jira/browse/ARROW-9643
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Frank Du
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: CMakeCache.txt, avx512-workaround.diff
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Some unit tests failed with *"Illegal instruction"* on Intel E5-2650 
> (haswell).
> {noformat}
> $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=ON 
> -DARROW_COMPUTE=ON ..
> $ ninja unittest
> ..
> The following tests FAILED:
>11 - arrow-stl-test (Failed)
>14 - arrow-diff-test (Failed)
>22 - arrow-compute-internals-test (Failed)
>23 - arrow-compute-scalar-test (Failed)
>24 - arrow-compute-vector-test (Failed)
>25 - arrow-compute-aggregate-test (Failed)
> $ release/arrow-stl-test
> ..
> Illegal instruction
> $ lscpu
> ..
> Model name:  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
> {noformat}
> Using git bitsect I find the PR causes this error, 
> [https://github.com/apache/arrow/pull/7700]
> [~frankdu], would you double check it? thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9597) [C++] AddAlias in compute::FunctionRegistry should be synchronized

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9597:
---
Fix Version/s: (was: 1.0.1)
   2.0.0

> [C++] AddAlias in compute::FunctionRegistry should be synchronized
> --
>
> Key: ARROW-9597
> URL: https://issues.apache.org/jira/browse/ARROW-9597
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9402) [C++] Add portable wrappers for __builtin_add_overflow and friends

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9402:
---
Fix Version/s: (was: 1.0.1)

> [C++] Add portable wrappers for __builtin_add_overflow and friends
> --
>
> Key: ARROW-9402
> URL: https://issues.apache.org/jira/browse/ARROW-9402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9631) [Rust] Arrow crate should not depend on flight

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9631:
---
Fix Version/s: 1.0.1

> [Rust] Arrow crate should not depend on flight
> --
>
> Key: ARROW-9631
> URL: https://issues.apache.org/jira/browse/ARROW-9631
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Mahmut Bulut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> It seems that the dependencies are inverted. The core arrow crate should 
> contain the array data structures and compute kernels and should not depend 
> on the flight crate, which contains protocols and brings in many dependencies.
> If we have code for converting between arrow types and flight types then that 
> code should live in the flight crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9631) [Rust] Arrow crate should not depend on flight

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9631:
---
Issue Type: Bug  (was: Improvement)

> [Rust] Arrow crate should not depend on flight
> --
>
> Key: ARROW-9631
> URL: https://issues.apache.org/jira/browse/ARROW-9631
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Mahmut Bulut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> It seems that the dependencies are inverted. The core arrow crate should 
> contain the array data structures and compute kernels and should not depend 
> on the flight crate, which contains protocols and brings in many dependencies.
> If we have code for converting between arrow types and flight types then that 
> code should live in the flight crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9592) [CI] Update homebrew before calling brew bundle

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9592:
---
Issue Type: Bug  (was: Improvement)

> [CI] Update homebrew before calling brew bundle
> ---
>
> Key: ARROW-9592
> URL: https://issues.apache.org/jira/browse/ARROW-9592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The macOS GHA builds have started to fail recently. We need to update brew 
> itself before installing the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9592) [CI] Update homebrew before calling brew bundle

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9592:
---
Fix Version/s: 1.0.1

> [CI] Update homebrew before calling brew bundle
> ---
>
> Key: ARROW-9592
> URL: https://issues.apache.org/jira/browse/ARROW-9592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The macOS GHA builds have started to fail recently. We need to update brew 
> itself before installing the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9703) [Developer][Archery] Restartable cherry-picking process for creating maintenance branches

2020-08-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9703:
--

 Summary: [Developer][Archery] Restartable cherry-picking process 
for creating maintenance branches
 Key: ARROW-9703
 URL: https://issues.apache.org/jira/browse/ARROW-9703
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery, Developer Tools
Reporter: Krisztian Szucs
 Fix For: 2.0.0


Archery already had some features to generate the cherry-picking commands, but 
conflicting patches can make the manual application/reapplication procedure 
complicated.

1. Add an archery command to recreate the maintenance branch based on a jira 
release.
2. Extend the above command with an option to continout the cherry picking 
process after a conslifc resolution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9703) [Developer][Archery] Restartable cherry-picking process for creating maintenance branches

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9703:
--
Labels: pull-request-available  (was: )

> [Developer][Archery] Restartable cherry-picking process for creating 
> maintenance branches
> -
>
> Key: ARROW-9703
> URL: https://issues.apache.org/jira/browse/ARROW-9703
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Archery, Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Archery already had some features to generate the cherry-picking commands, 
> but conflicting patches can make the manual application/reapplication 
> procedure complicated.
> 1. Add an archery command to recreate the maintenance branch based on a jira 
> release.
> 2. Extend the above command with an option to continout the cherry picking 
> process after a conslifc resolution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9577) [Python][C++] posix_madvise error on Debian in pyarrow 1.0.0

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9577.
---
Resolution: Fixed

Issue resolved by pull request 7904
[https://github.com/apache/arrow/pull/7904]

> [Python][C++] posix_madvise error on Debian in pyarrow 1.0.0
> 
>
> Key: ARROW-9577
> URL: https://issues.apache.org/jira/browse/ARROW-9577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Installed with Miniconda (for Debian; used pip for the 
> Ubuntu test)
>Reporter: Jim Pivarski
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.0.1
>
> Attachments: location-of-pxi-files.log, strace-parquet-read.log, 
> stuff.parquet
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The following writes and reads back from a Parquet file in both pyarrow 
> 0.17.0 and 1.0.0 on Ubuntu 18.04:
>  
> {code:java}
> >>> import pyarrow.parquet
> >>> a = pyarrow.array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
> >>> t = pyarrow.Table.from_batches([pyarrow.RecordBatch.from_arrays([a], 
> >>> ["stuff"])])
> >>> pyarrow.parquet.write_table(t, "stuff.parquet")
> >>> t2 = pyarrow.parquet.read_table("stuff.parquet") {code}
>  
> However, the same thing raises the following exception on Debian 9 (stretch) 
> in pyarrow 1.0.0 but not in pyarrow 0.17.0:
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1564, in read_table
> filters=filters,
>   File 
> "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1433, in __init__
> partitioning=partitioning)
>   File 
> "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/dataset.py", 
> line 667, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File 
> "/home/jpivarski/miniconda3/lib/python3.7/site-packages/pyarrow/dataset.py", 
> line 434, in _filesystem_dataset
> return factory.finish(schema)
>   File "pyarrow/_dataset.pyx", line 1451, in 
> pyarrow._dataset.DatasetFactory.finish
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: posix_madvise failed. Detail: [errno 0] Success{code}
> It's a little odd that the error says that it failed with "detail: success". 
> That suggests to me that an "if" predicate is backward (missing "not"), which 
> might only be triggered on some OS/distributions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9684) [C++] Fix undefined behaviour on invalid IPC / Parquet input (OSS-Fuzz)

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9684.
---
Resolution: Fixed

Issue resolved by pull request 7927
[https://github.com/apache/arrow/pull/7927]

> [C++] Fix undefined behaviour on invalid IPC / Parquet input (OSS-Fuzz)
> ---
>
> Key: ARROW-9684
> URL: https://issues.apache.org/jira/browse/ARROW-9684
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.0.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9598) [C++][Parquet] Spaced definition levels is not assigned correctly.

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9598:
--
Fix Version/s: 2.0.0

> [C++][Parquet]  Spaced definition levels is not assigned correctly.
> ---
>
> Key: ARROW-9598
> URL: https://issues.apache.org/jira/browse/ARROW-9598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The existing code assumes that there is only a single repeated parent.  Code 
> needs to backtrack until null or or a repeated parent.  Unfortunately without 
> ability to read path that can read mixed struct/repeated values we can't 
> fully test the fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9598) [C++][Parquet] Spaced definition levels is not assigned correctly.

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9598.
---
Resolution: Fixed

Fixed in PR 

> [C++][Parquet]  Spaced definition levels is not assigned correctly.
> ---
>
> Key: ARROW-9598
> URL: https://issues.apache.org/jira/browse/ARROW-9598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The existing code assumes that there is only a single repeated parent.  Code 
> needs to backtrack until null or or a repeated parent.  Unfortunately without 
> ability to read path that can read mixed struct/repeated values we can't 
> fully test the fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9598) [C++][Parquet] Spaced definition levels is not assigned correctly.

2020-08-12 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176333#comment-17176333
 ] 

Antoine Pitrou edited comment on ARROW-9598 at 8/12/20, 1:21 PM:
-

Fixed in PR https://github.com/apache/arrow/pull/7862


was (Author: pitrou):
Fixed in PR 

> [C++][Parquet]  Spaced definition levels is not assigned correctly.
> ---
>
> Key: ARROW-9598
> URL: https://issues.apache.org/jira/browse/ARROW-9598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The existing code assumes that there is only a single repeated parent.  Code 
> needs to backtrack until null or or a repeated parent.  Unfortunately without 
> ability to read path that can read mixed struct/repeated values we can't 
> fully test the fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9602) [R] Improve cmake detection in Linux build

2020-08-12 Thread Matt Pollock (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176336#comment-17176336
 ] 

Matt Pollock commented on ARROW-9602:
-

I tested this using {{install_arrow(binary = FALSE, use_system = FALSE, 
nightly=TRUE)}} and it worked great. Thanks!

> [R] Improve cmake detection in Linux build
> --
>
> Key: ARROW-9602
> URL: https://issues.apache.org/jira/browse/ARROW-9602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0
>Reporter: Matt Pollock
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> > arrow::write_parquet(iris, "~/iris") 
> *** caught segfault ***
> address (nil), cause 'memory not mapped' Traceback: 1: Table__from_dots(dots, 
> schema) 2: shared_ptr_is_null(xp) 3: shared_ptr(Table, Table__from_dots(dots, 
> schema)) 4: Table$create(x) 5: arrow::write_parquet(iris, "~/iris")
> {code}
> The segfault is easy to generate trying to write iris data to parquet. I have 
> tried R 4.0.0 and R 4.0.2, I've installed the arrow (R) package from CRAN, 
> source, nightly build, both with and without using the system arrow C++ 
> installation. When using system arrow the installed version is:
> {noformat}
> Installed Packages 
> Name        : arrow-devel 
> Arch        : x86_64 
> Version     : 1.0.0 
> Release     : 1.el7 
> Size        : 32 M 
> Repo        : installed 
> From repo   : apache-arrow 
> Summary     : Libraries and header files for Apache Arrow C++ 
> URL         : https://arrow.apache.org/ 
> License     : Apache-2.0 
> Description : Libraries and header files for Apache Arrow C++.
> {noformat}
>  I realize that this is so basic that it seems improbable that your CI didn't 
> catch something (i.e., that the issue has to do with my local environment) 
> but would appreciate verification that version 1.0 works for others on centOS7



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9693) [CI][Docs] Nightly docs build fails

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-9693:
--

Assignee: Krisztian Szucs

> [CI][Docs] Nightly docs build fails
> ---
>
> Key: ARROW-9693
> URL: https://issues.apache.org/jira/browse/ARROW-9693
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=15998&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> {code}
> ...
> reading sources... [ 13%] format/Integration
> reading sources... [ 13%] format/Layout
> reading sources... [ 13%] format/Metadata
> reading sources... [ 13%] format/Other
> reading sources... [ 14%] format/Versioning
> reading sources... [ 14%] index
> reading sources... [ 14%] java/index
> reading sources... [ 14%] java/ipc
> reading sources... [ 14%] java/vector
> reading sources... [ 15%] java/vector_schema_root
> reading sources... [ 15%] python/api
> reading sources... [ 15%] python/api/arrays
> /arrow/docs/source/cpp/api/flight.rst:204: WARNING: doxygenfunction: Unable 
> to resolve multiple matches for function "arrow::flight::MakeFlightError" 
> with arguments () in doxygen xml output for project "arrow_cpp" from 
> directory: ../../cpp/apidoc/xml.
> Potential matches:
> - Status MakeFlightError(FlightStatusCode code, const std::string 
> &message)
> - Status MakeFlightError(FlightStatusCode code, const std::string 
> &message, const std::string &extra_info)
> Extension error:
> Handler  for event 
> 'autodoc-process-docstring' threw an exception (exception:  array> is not a module, class, method, function, traceback, frame, or code 
> object)
> Error: `docker-compose --file /home/vsts/work/1/s/arrow/docker-compose.yml 
> run --rm -e SETUPTOOLS_SCM_PRETEND_VERSION=1.1.0.dev75 ubuntu-docs` exited 
> with a non-zero exit code 2, see the process log above.
> {code}
> cc [~lidavidm] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9693) [CI][Docs] Nightly docs build fails

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9693:
--
Labels: pull-request-available  (was: )

> [CI][Docs] Nightly docs build fails
> ---
>
> Key: ARROW-9693
> URL: https://issues.apache.org/jira/browse/ARROW-9693
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=15998&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181
> {code}
> ...
> reading sources... [ 13%] format/Integration
> reading sources... [ 13%] format/Layout
> reading sources... [ 13%] format/Metadata
> reading sources... [ 13%] format/Other
> reading sources... [ 14%] format/Versioning
> reading sources... [ 14%] index
> reading sources... [ 14%] java/index
> reading sources... [ 14%] java/ipc
> reading sources... [ 14%] java/vector
> reading sources... [ 15%] java/vector_schema_root
> reading sources... [ 15%] python/api
> reading sources... [ 15%] python/api/arrays
> /arrow/docs/source/cpp/api/flight.rst:204: WARNING: doxygenfunction: Unable 
> to resolve multiple matches for function "arrow::flight::MakeFlightError" 
> with arguments () in doxygen xml output for project "arrow_cpp" from 
> directory: ../../cpp/apidoc/xml.
> Potential matches:
> - Status MakeFlightError(FlightStatusCode code, const std::string 
> &message)
> - Status MakeFlightError(FlightStatusCode code, const std::string 
> &message, const std::string &extra_info)
> Extension error:
> Handler  for event 
> 'autodoc-process-docstring' threw an exception (exception:  array> is not a module, class, method, function, traceback, frame, or code 
> object)
> Error: `docker-compose --file /home/vsts/work/1/s/arrow/docker-compose.yml 
> run --rm -e SETUPTOOLS_SCM_PRETEND_VERSION=1.1.0.dev75 ubuntu-docs` exited 
> with a non-zero exit code 2, see the process log above.
> {code}
> cc [~lidavidm] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9528) [Python] Honor tzinfo information when converting from datetime to pyarrow

2020-08-12 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-9528:
--

Assignee: Krisztian Szucs

> [Python] Honor tzinfo information when converting from datetime to pyarrow
> --
>
> Key: ARROW-9528
> URL: https://issues.apache.org/jira/browse/ARROW-9528
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9704) [Java] TestEndianness.testLittleEndian fails on big endian platform

2020-08-12 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-9704:
---

 Summary: [Java] TestEndianness.testLittleEndian fails on big 
endian platform
 Key: ARROW-9704
 URL: https://issues.apache.org/jira/browse/ARROW-9704
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Kazuaki Ishizaki
Assignee: Kazuaki Ishizaki
 Fix For: 2.0.0


{{TestEndianness.testLittleEndian}} assumes that the data layout of int is 
little-endian. Thus, this test fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9705) [C++] Validate that intraday time is zeroed out in Date64 data

2020-08-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9705:
---

 Summary: [C++] Validate that intraday time is zeroed out in Date64 
data
 Key: ARROW-9705
 URL: https://issues.apache.org/jira/browse/ARROW-9705
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9706) [Java] Tests in TestLargeListVector fails on big endian platform

2020-08-12 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-9706:
---

 Summary: [Java] Tests in TestLargeListVector fails on big endian 
platform
 Key: ARROW-9706
 URL: https://issues.apache.org/jira/browse/ARROW-9706
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki
Assignee: Kazuaki Ishizaki


Multiple test cases (e.g. {{testSetLastSetUsage}}) in {{TestLargeListVector}} 
fail on a big-endian platform.

This is because these test cases read offset as integer while the width of the 
offset is 8-byte. This means that only the first 4-byte are read. It works only 
for a little-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9659) [C++] RecordBatchStreamReader throws on CUDA device buffers

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9659.
---
Resolution: Fixed

Issue resolved by pull request 7909
[https://github.com/apache/arrow/pull/7909]

> [C++] RecordBatchStreamReader throws on CUDA device buffers
> ---
>
> Key: ARROW-9659
> URL: https://issues.apache.org/jira/browse/ARROW-9659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Paul Taylor
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: cuda, pull-request-available
> Fix For: 2.0.0, 1.0.1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Prior to 1.0.0, the RecordBatchStreamReader was capable of reading source 
> CudaBuffers wrapped in a CudaBufferReader. In 1.0.0, the Array validation 
> routines call into Buffer::data(), which throws an error if the source isn't 
> in host memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9659) [C++] RecordBatchStreamReader throws on CUDA device buffers

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9659:
-

Assignee: Paul Taylor  (was: Antoine Pitrou)

> [C++] RecordBatchStreamReader throws on CUDA device buffers
> ---
>
> Key: ARROW-9659
> URL: https://issues.apache.org/jira/browse/ARROW-9659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: cuda, pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Prior to 1.0.0, the RecordBatchStreamReader was capable of reading source 
> CudaBuffers wrapped in a CudaBufferReader. In 1.0.0, the Array validation 
> routines call into Buffer::data(), which throws an error if the source isn't 
> in host memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9464:
--
Summary: [Rust] [DataFusion] Physical plan refactor to support optimization 
rules and more efficient use of threads  (was: [Rust] [DataFusion] Physical 
plan refactor to support async and optimization rules)

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning(&self) -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution(&self) -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering(&self) -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering(&self) -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support async and optimization rules

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9464:
--
Description: 
I would like to propose a refactor of the physical/execution planning based on 
the experience I have had in implementing distributed execution in Ballista.

This will likely need subtasks but here is an overview of the changes I am 
proposing.
h3. *Introduce enum to represent physical plan.*

By wrapping the execution plan structs in an enum, we make it possible to build 
a tree representing the physical plan just like we do with the logical plan. 
This makes it easy to print physical plans and also to apply transformations to 
it.
{code:java}
 pub enum PhysicalPlan {
/// Projection.
Projection(Arc),
/// Filter a.k.a predicate.
Filter(Arc),
/// Hash aggregate
HashAggregate(Arc),
/// Performs a hash join of two child relations by first shuffling the data 
using the join keys.
ShuffledHashJoin(ShuffledHashJoinExec),
/// Performs a shuffle that will result in the desired partitioning.
ShuffleExchange(Arc),
/// Reads results from a ShuffleExchange
ShuffleReader(Arc),
/// Scans a partitioned data source
ParquetScan(Arc),
/// Scans an in-memory table
InMemoryTableScan(Arc),
}{code}
h3. *Introduce physical plan optimization rule to insert "shuffle" operators*

We should extend the ExecutionPlan trait so that each operator can specify its 
input and output partitioning needs, and then have an optimization rule that 
can insert any repartioning or reordering steps required.

For example, these are the methods to be added to ExecutionPlan. This design is 
based on Apache Spark.

 
{code:java}
/// Specifies how data is partitioned across different nodes in the cluster
fn output_partitioning(&self) -> Partitioning {
Partitioning::UnknownPartitioning(0)
}

/// Specifies the data distribution requirements of all the children for this 
operator
fn required_child_distribution(&self) -> Distribution {
Distribution::UnspecifiedDistribution
}

/// Specifies how data is ordered in each partition
fn output_ordering(&self) -> Option> {
None
}

/// Specifies the data distribution requirements of all the children for this 
operator
fn required_child_ordering(&self) -> Option>> {
None
}
 {code}
A good example of applying this rule would be in the case of hash aggregates 
where we perform a partial aggregate in parallel across partitions and then 
coalesce the results and apply a final hash aggregate.

Another example would be a SortMergeExec specifying the sort order required for 
its children.

 

 

  was:
I would like to propose a refactor of the physical/execution planning based on 
the experience I have had in implementing distributed execution in Ballista.

This will likely need subtasks but here is an overview of the changes I am 
proposing.
h3. *Introduce enum to represent physical plan.*

By wrapping the execution plan structs in an enum, we make it possible to build 
a tree representing the physical plan just like we do with the logical plan. 
This makes it easy to print physical plans and also to apply transformations to 
it.
{code:java}
 pub enum PhysicalPlan {
/// Projection.
Projection(Arc),
/// Filter a.k.a predicate.
Filter(Arc),
/// Hash aggregate
HashAggregate(Arc),
/// Performs a hash join of two child relations by first shuffling the data 
using the join keys.
ShuffledHashJoin(ShuffledHashJoinExec),
/// Performs a shuffle that will result in the desired partitioning.
ShuffleExchange(Arc),
/// Reads results from a ShuffleExchange
ShuffleReader(Arc),
/// Scans a partitioned data source
ParquetScan(Arc),
/// Scans an in-memory table
InMemoryTableScan(Arc),
}{code}
h3. *Introduce physical plan optimization rule to insert "shuffle" operators*

We should extend the ExecutionPlan trait so that each operator can specify its 
input and output partitioning needs, and then have an optimization rule that 
can insert any repartioning or reordering steps required.

For example, these are the methods to be added to ExecutionPlan. This design is 
based on Apache Spark.

 
{code:java}
/// Specifies how data is partitioned across different nodes in the cluster
fn output_partitioning(&self) -> Partitioning {
Partitioning::UnknownPartitioning(0)
}

/// Specifies the data distribution requirements of all the children for this 
operator
fn required_child_distribution(&self) -> Distribution {
Distribution::UnspecifiedDistribution
}

/// Specifies how data is ordered in each partition
fn output_ordering(&self) -> Option> {
None
}

/// Specifies the data distribution requirements of all the children for this 
operator
fn required_child_ordering(&self) -> Option>> {
None
}
 {code}
A good example of applying this rule would be in the case of hash aggregates 
where we perform

[jira] [Closed] (ARROW-9480) [Rust] [DataFusion] All DataFusion execution plan traits should require Send + Sync

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-9480.
-
Resolution: Won't Fix

Based on recent experience testing query execution with async, I no longer feel 
that this makes sense. Async is good for network io but not for file io. It is 
better to have a dedicated thread per partition when executing queries. 

Also, we can't use async with Parquet currently without launching a dedicated 
thread per partition which pretty much defeats the point of using async in the 
first place.

> [Rust] [DataFusion] All DataFusion execution plan traits should require Send 
> + Sync
> ---
>
> Key: ARROW-9480
> URL: https://issues.apache.org/jira/browse/ARROW-9480
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> All DataFusion execution plan traits should require Send + Sync, to prepare 
> for async support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176414#comment-17176414
 ] 

Andy Grove edited comment on ARROW-9464 at 8/12/20, 3:19 PM:
-

Based on recent experience testing query execution with async, I no longer feel 
that async makes sense for DataFusion. Async is good for network io but not for 
file io. It is better to have a single thread per partition when executing 
queries.

Also, we can't use async with Parquet currently without launching a dedicated 
thread per partition which pretty much defeats the point of using async in the 
first place.

I believe that we do need the concept of executors and a scheduler in 
DataFusion, where each executor would run on a dedicated thread. Other projects 
would then be able to extend this for distributed execution for example.


was (Author: andygrove):
Based on recent experience testing query execution with async, I no longer feel 
that async makes sense for DataFusion. Async is good for network io but not for 
file io. It is better to have a dedicated thread per partition when executing 
queries. 

Also, we can't use async with Parquet currently without launching a dedicated 
thread per partition which pretty much defeats the point of using async in the 
first place.

I believe that we do need the concept of executors and a scheduler in 
DataFusion, where each executor would run on a dedicated thread. Other projects 
would then be able to extend this for distributed execution for example.

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning(&self) -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution(&self) -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering(&self) -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering(&self) -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176414#comment-17176414
 ] 

Andy Grove commented on ARROW-9464:
---

Based on recent experience testing query execution with async, I no longer feel 
that async makes sense for DataFusion. Async is good for network io but not for 
file io. It is better to have a dedicated thread per partition when executing 
queries. 

Also, we can't use async with Parquet currently without launching a dedicated 
thread per partition which pretty much defeats the point of using async in the 
first place.

I believe that we do need the concept of executors and a scheduler in 
DataFusion, where each executor would run on a dedicated thread. Other projects 
would then be able to extend this for distributed execution for example.

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning(&self) -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution(&self) -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering(&self) -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering(&self) -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9707:
-

 Summary: [Rust] [DataFusion] Re-implement threading model
 Key: ARROW-9707
 URL: https://issues.apache.org/jira/browse/ARROW-9707
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


The current threading model is very simple and does not scale. We currently use 
1-2 dedicated threads per partition and they all run simultaneously, which is a 
huge problem if you have more partitions than logical or physical cores.

This task is to re-implement the threading model so that query execution uses a 
fixed (configurable) number of threads. Work will be broken down into stages 
and tasks and each in-process executor (running on a dedicated thread) will 
process its queue of tasks.

This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9708) [Rust] [DataFusion] Remove use of threads and channels from ParquetScanExec

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9708:
-

 Summary: [Rust] [DataFusion] Remove use of threads and channels 
from ParquetScanExec
 Key: ARROW-9708
 URL: https://issues.apache.org/jira/browse/ARROW-9708
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


At the time I implemented the parquet scan exec I thought it was necessary to 
run in a thread and use channels to communicate but this can be avoided in 
conjunction with other changes in the parent issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-12 Thread Adam Lippai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176423#comment-17176423
 ] 

Adam Lippai commented on ARROW-9464:


I think using sync file io is a good compromise, Arrow or Datafusion doesn't 
perform low-latency or highly concurrent file io, at least not yet.  

Does "It is better to have a single thread per partition when executing 
queries." contradict "we do need the concept of executors and a scheduler in 
DataFusion"?
What do you think about my initial concern regarding the number of max threads?
Does limiting the concurrency or using a threadpool make sense?

If I have a partitioned dataset (let's say 1000 or 10k files), each with 1000 
columns I should be able to read and process it without spawning this amount of 
threads _at once_.

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning(&self) -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution(&self) -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering(&self) -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering(&self) -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2303) [C++] Disable ASAN when building io-hdfs-test.cc

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2303:
--
Priority: Minor  (was: Major)

> [C++] Disable ASAN when building io-hdfs-test.cc
> 
>
> Key: ARROW-2303
> URL: https://issues.apache.org/jira/browse/ARROW-2303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 2.0.0
>
>
> ASAN reports spurious memory leaks in this unit test module. I am not sure 
> the easiest way to conditionally scrub the ASAN flags from such a unit test's 
> compilation flags



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9709) [Java] Test cases in arrow-vector assume little-endian platform

2020-08-12 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-9709:
---

 Summary: [Java] Test cases in arrow-vector assume little-endian 
platform
 Key: ARROW-9709
 URL: https://issues.apache.org/jira/browse/ARROW-9709
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki
Assignee: Kazuaki Ishizaki


{{MessageSerializerTest.testWriteMessageBufferAligned}}, 
{{TestArrowReaderWriter.testChannelReadFully}} and 
{{TestArrowReaderWriter.testChannelReadFullyEos}} assume only a little-endian 
platform.

Two tests in {{TestArrowReaderWriter}} fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9704) [Java] TestEndianness.testLittleEndian fails on big endian platform

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9704:
--
Labels: pull-request-available  (was: )

> [Java] TestEndianness.testLittleEndian fails on big endian platform
> ---
>
> Key: ARROW-9704
> URL: https://issues.apache.org/jira/browse/ARROW-9704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{TestEndianness.testLittleEndian}} assumes that the data layout of int is 
> little-endian. Thus, this test fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9704) [Java] TestEndianness.testLittleEndian fails on big endian platform

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9704:


Assignee: Kazuaki Ishizaki  (was: Apache Arrow JIRA Bot)

> [Java] TestEndianness.testLittleEndian fails on big endian platform
> ---
>
> Key: ARROW-9704
> URL: https://issues.apache.org/jira/browse/ARROW-9704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{TestEndianness.testLittleEndian}} assumes that the data layout of int is 
> little-endian. Thus, this test fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9704) [Java] TestEndianness.testLittleEndian fails on big endian platform

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9704:


Assignee: Apache Arrow JIRA Bot  (was: Kazuaki Ishizaki)

> [Java] TestEndianness.testLittleEndian fails on big endian platform
> ---
>
> Key: ARROW-9704
> URL: https://issues.apache.org/jira/browse/ARROW-9704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{TestEndianness.testLittleEndian}} assumes that the data layout of int is 
> little-endian. Thus, this test fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9706) [Java] Tests in TestLargeListVector fails on big endian platform

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9706:
--
Labels: pull-request-available  (was: )

> [Java] Tests in TestLargeListVector fails on big endian platform
> 
>
> Key: ARROW-9706
> URL: https://issues.apache.org/jira/browse/ARROW-9706
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Multiple test cases (e.g. {{testSetLastSetUsage}}) in {{TestLargeListVector}} 
> fail on a big-endian platform.
> This is because these test cases read offset as integer while the width of 
> the offset is 8-byte. This means that only the first 4-byte are read. It 
> works only for a little-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9706) [Java] Tests in TestLargeListVector fails on big endian platform

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9706:


Assignee: Apache Arrow JIRA Bot  (was: Kazuaki Ishizaki)

> [Java] Tests in TestLargeListVector fails on big endian platform
> 
>
> Key: ARROW-9706
> URL: https://issues.apache.org/jira/browse/ARROW-9706
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Multiple test cases (e.g. {{testSetLastSetUsage}}) in {{TestLargeListVector}} 
> fail on a big-endian platform.
> This is because these test cases read offset as integer while the width of 
> the offset is 8-byte. This means that only the first 4-byte are read. It 
> works only for a little-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9706) [Java] Tests in TestLargeListVector fails on big endian platform

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9706:


Assignee: Kazuaki Ishizaki  (was: Apache Arrow JIRA Bot)

> [Java] Tests in TestLargeListVector fails on big endian platform
> 
>
> Key: ARROW-9706
> URL: https://issues.apache.org/jira/browse/ARROW-9706
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Multiple test cases (e.g. {{testSetLastSetUsage}}) in {{TestLargeListVector}} 
> fail on a big-endian platform.
> This is because these test cases read offset as integer while the width of 
> the offset is 8-byte. This means that only the first 4-byte are read. It 
> works only for a little-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176442#comment-17176442
 ] 

Andy Grove commented on ARROW-9464:
---

I did a slightly better job of explaining this in 
https://issues.apache.org/jira/browse/ARROW-9707

"The current threading model is very simple and does not scale. We currently 
use 1-2 dedicated threads per partition and they all run simultaneously, which 
is a huge problem if you have more partitions than logical or physical cores.
This task is to re-implement the threading model so that query execution uses a 
fixed (configurable) number of threads. Work will be broken down into stages 
and tasks and each in-process executor (running on a dedicated thread) will 
process its queue of tasks.

This process will be driven by a scheduler."

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce enum to represent physical plan.*
> By wrapping the execution plan structs in an enum, we make it possible to 
> build a tree representing the physical plan just like we do with the logical 
> plan. This makes it easy to print physical plans and also to apply 
> transformations to it.
> {code:java}
>  pub enum PhysicalPlan {
> /// Projection.
> Projection(Arc),
> /// Filter a.k.a predicate.
> Filter(Arc),
> /// Hash aggregate
> HashAggregate(Arc),
> /// Performs a hash join of two child relations by first shuffling the 
> data using the join keys.
> ShuffledHashJoin(ShuffledHashJoinExec),
> /// Performs a shuffle that will result in the desired partitioning.
> ShuffleExchange(Arc),
> /// Reads results from a ShuffleExchange
> ShuffleReader(Arc),
> /// Scans a partitioned data source
> ParquetScan(Arc),
> /// Scans an in-memory table
> InMemoryTableScan(Arc),
> }{code}
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning(&self) -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution(&self) -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering(&self) -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering(&self) -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9709) [Java] Test cases in arrow-vector assume little-endian platform

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9709:
--
Labels: pull-request-available  (was: )

> [Java] Test cases in arrow-vector assume little-endian platform
> ---
>
> Key: ARROW-9709
> URL: https://issues.apache.org/jira/browse/ARROW-9709
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{MessageSerializerTest.testWriteMessageBufferAligned}}, 
> {{TestArrowReaderWriter.testChannelReadFully}} and 
> {{TestArrowReaderWriter.testChannelReadFullyEos}} assume only a little-endian 
> platform.
> Two tests in {{TestArrowReaderWriter}} fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9709) [Java] Test cases in arrow-vector assume little-endian platform

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9709:


Assignee: Apache Arrow JIRA Bot  (was: Kazuaki Ishizaki)

> [Java] Test cases in arrow-vector assume little-endian platform
> ---
>
> Key: ARROW-9709
> URL: https://issues.apache.org/jira/browse/ARROW-9709
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{MessageSerializerTest.testWriteMessageBufferAligned}}, 
> {{TestArrowReaderWriter.testChannelReadFully}} and 
> {{TestArrowReaderWriter.testChannelReadFullyEos}} assume only a little-endian 
> platform.
> Two tests in {{TestArrowReaderWriter}} fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9709) [Java] Test cases in arrow-vector assume little-endian platform

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9709:


Assignee: Kazuaki Ishizaki  (was: Apache Arrow JIRA Bot)

> [Java] Test cases in arrow-vector assume little-endian platform
> ---
>
> Key: ARROW-9709
> URL: https://issues.apache.org/jira/browse/ARROW-9709
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{MessageSerializerTest.testWriteMessageBufferAligned}}, 
> {{TestArrowReaderWriter.testChannelReadFully}} and 
> {{TestArrowReaderWriter.testChannelReadFullyEos}} assume only a little-endian 
> platform.
> Two tests in {{TestArrowReaderWriter}} fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-08-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176481#comment-17176481
 ] 

Wes McKinney commented on ARROW-9707:
-

I'll be curious what approach you take to prevent IO steps from blocking CPU 
work, we still haven't sorted out how we're dealing with that broadly in C++

> [Rust] [DataFusion] Re-implement threading model
> 
>
> Key: ARROW-9707
> URL: https://issues.apache.org/jira/browse/ARROW-9707
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> The current threading model is very simple and does not scale. We currently 
> use 1-2 dedicated threads per partition and they all run simultaneously, 
> which is a huge problem if you have more partitions than logical or physical 
> cores.
> This task is to re-implement the threading model so that query execution uses 
> a fixed (configurable) number of threads. Work will be broken down into 
> stages and tasks and each in-process executor (running on a dedicated thread) 
> will process its queue of tasks.
> This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-08-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176493#comment-17176493
 ] 

Andy Grove commented on ARROW-9707:
---

One approach I accidentally stumbled on was to have a dedicated thread
reading from disk and then have other operators running with async and
using channels to send batches from the disk reader to the downstream
operators. This is roughly how things are implemented today in DataFusion
(but not leveraging async).

The advantage of this approach is that the reader thread is at least
running in parallel to downstream operators processing previous batches.
The downside of course is having a dedicated thread per operator that reads
from disk.

I would be interested in making the Parquet crate async so that we can test
async end to end (even though I've been told that async is not good for
file io) but unfortunately the work to do that is non-trivial.

On Wed, Aug 12, 2020 at 11:00 AM Wes McKinney (Jira) 



> [Rust] [DataFusion] Re-implement threading model
> 
>
> Key: ARROW-9707
> URL: https://issues.apache.org/jira/browse/ARROW-9707
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> The current threading model is very simple and does not scale. We currently 
> use 1-2 dedicated threads per partition and they all run simultaneously, 
> which is a huge problem if you have more partitions than logical or physical 
> cores.
> This task is to re-implement the threading model so that query execution uses 
> a fixed (configurable) number of threads. Work will be broken down into 
> stages and tasks and each in-process executor (running on a dedicated thread) 
> will process its queue of tasks.
> This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9668) [C++]Got AVX512 method compiled in, but AVX512 instructions is not supported

2020-08-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-9668.
---

> [C++]Got AVX512 method compiled in, but AVX512 instructions is not supported
> 
>
> Key: ARROW-9668
> URL: https://issues.apache.org/jira/browse/ARROW-9668
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Centos, c++ (GCC) 7.4.0
>Reporter: Dongxiao Song
>Priority: Major
>
> When using garrow_numeric_array_compare() to compare arrow array, I got 
>  arrow::compute::aggregate::AddSumAvx512AggKernels()  called.
> But my compute doesn't support avx512 instructions set, so it crashed and 
> reported:
> Program terminated with signal 4, Illegal instruction.
> #0  0x7efc082fe1d7 in 
> arrow::compute::aggregate::AddSumAvx512AggKernels(arrow::compute::ScalarAggregateFunction*)
>  () from /usr/local/lib64/libarrow.so.200
>  
> Is this a bug?
> I found that  CXX_SUPPORTS_AVX512 flag is decided by whether compiler could 
> compile with
> -march=skylake-avx512 -mbmi2. In my opinion, these options is just tell the 
> compiler try to compile with avx512, if not supported, it doesn't complain 
> anything.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7843) [Ruby] MSYS2 packages needed for Gandiva

2020-08-12 Thread Dominic Sisneros (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176514#comment-17176514
 ] 

Dominic Sisneros commented on ARROW-7843:
-

Thank you - I pushed to get it uploaded to the pacman repository on the gitter 
channel since it was taking so long. https://gitter.im/msys2/msys2.  Thanks for 
all your work

> [Ruby] MSYS2 packages needed for Gandiva
> 
>
> Key: ARROW-7843
> URL: https://issues.apache.org/jira/browse/ARROW-7843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 0.16.0
> Environment: windows with rubyinstaller
>Reporter: Dominic Sisneros
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 1.0.0
>
>
> {noformat}
> require "gandiva"
> table = Arrow::Table.new(:field1 => Arrow::Int32Array.new([1, 2, 3, 4]),
>  :field2 => Arrow::Int32Array.new([11, 13, 15, 17]))
> schema = table.schema
> expression1 = schema.build_expression do |record|
>   record.field1 + record.field2
> end
> expression2 = schema.build_expression do |record, context|
>   context.if(record.field1 > record.field2)
> .then(record.field1 / record.field2)
> .else(record.field1)
> end
> projector = Gandiva::Projector.new(schema, [expression1, expression2])
> table.each_record_batch do |record_batch|
>   outputs = projector.evaluate(record_batch)
>   puts outputs.collect(&:values)
> end
> C:\Users\Dominic E Sisneros\source\repos\ruby\try_arrow>ruby gandiva_test2.rb
> Traceback (most recent call last):
> 2: from gandiva_test2.rb:1:in `'
> 1: from 
> c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in 
> `require'
> c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in 
> `require': cannot load such file -- gandiva (LoadError)
> 9: from gandiva_test2.rb:1:in `'
> 8: from 
> c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:156:in 
> `require'
> 7: from 
> c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:168:in 
> `rescue in require'
> 6: from 
> c:/Ruby27-x64/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:168:in 
> `require'
> 5: from 
> c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/red-gandiva-0.16.0/lib/gandiva.rb:24:in
>  `'
> 4: from 
> c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/red-gandiva-0.16.0/lib/gandiva.rb:28:in
>  `'
> 3: from 
> c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/red-gandiva-0.16.0/lib/gandiva/loader.rb:22:in
>  `load'
> 2: from 
> c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.4.1/lib/gobject-introspection/loader.rb:25:in
>  `load'
> 1: from 
> c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.4.1/lib/gobject-introspection/loader.rb:37:in
>  `load'
> c:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.4.1/lib/gobject-introspection/loader.rb:37:in
>  `require': Typelib file for namespace 'Gandiva' (any version) not found 
> (GObjectIntrospection::RepositoryError::TypelibNotFound)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-08-12 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176522#comment-17176522
 ] 

Andrew Lamb commented on ARROW-9707:


[~andygrove] I am also very interested in this change (it is something we have 
been studying / thinking about this with [~pauldix]. What you have outlined (a 
fixed and configurable number of threads) is exactly our use case. I would 
enjoy collaborating with you if you have any need 

> [Rust] [DataFusion] Re-implement threading model
> 
>
> Key: ARROW-9707
> URL: https://issues.apache.org/jira/browse/ARROW-9707
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> The current threading model is very simple and does not scale. We currently 
> use 1-2 dedicated threads per partition and they all run simultaneously, 
> which is a huge problem if you have more partitions than logical or physical 
> cores.
> This task is to re-implement the threading model so that query execution uses 
> a fixed (configurable) number of threads. Work will be broken down into 
> stages and tasks and each in-process executor (running on a dedicated thread) 
> will process its queue of tasks.
> This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9694) [Ruby] can't install red-arrow-gsl

2020-08-12 Thread Dominic Sisneros (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176527#comment-17176527
 ] 

Dominic Sisneros commented on ARROW-9694:
-

I s ubmitted patch to rb-gsl.  I was able to gem install gsl.  Tried to install 
red-arrow-gsl and got the following gem_make.out

https://gist.github.com/872dbbfd07fb7bc7709364e3949eac33

> [Ruby] can't install red-arrow-gsl
> --
>
> Key: ARROW-9694
> URL: https://issues.apache.org/jira/browse/ARROW-9694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 1.0.0
> Environment: windows, msys2, 
>Reporter: Dominic Sisneros
>Priority: Major
> Attachments: gem_make.out
>
>
> {noformat}
> f:\programming\source\repos\ruby\try_arrow>gem install red-arrow-gsl
> Temporarily enhancing PATH for MSYS/MINGW...
> Building native extensions. This could take a while...
> ERROR:  Error installing red-arrow-gsl:
> ERROR: Failed to build gem native extension.
> current directory: 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3/ext/gsl_native
> F:/windows/scoop/apps/ruby/2.7.1-1/bin/ruby.exe -I 
> F:/windows/scoop/apps/ruby/2.7.1-1/lib/ruby/site_ruby/2.7.0 -r 
> ./siteconf20200811-28480-149f31i.rb extconf.rb
> sh: gsl-config: No such file or directory
> *** ERROR: missing required library to compile this module: undefined method 
> `chomp' for nil:NilClass
> *** extconf.rb failed ***
> Could not create Makefile due to some reason, probably lack of necessary
> libraries and/or headers.  Check the mkmf.log file for more details.  You may
> need configuration options.
> Provided configuration options:
> --with-opt-dir
> --without-opt-dir
> --with-opt-include
> --without-opt-include=${opt-dir}/include
> --with-opt-lib
> --without-opt-lib=${opt-dir}/lib
> --with-make-prog
> --without-make-prog
> --srcdir=.
> --curdir
> --ruby=F:/windows/scoop/apps/ruby/2.7.1-1/bin/$(RUBY_BASE_NAME)
> --with-gsl-version
> extconf failed, exit code 1
> Gem files will remain installed in 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3 for inspection.
> Results logged to 
> F:/windows/scoop/persist/ruby/gems/extensions/x64-mingw32/2.7.0/gsl-2.1.0.3/gem_make.out
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9604) [C++] Add benchmark for aggregate min/max compute kernels

2020-08-12 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9604.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7870
[https://github.com/apache/arrow/pull/7870]

> [C++] Add benchmark for aggregate min/max compute kernels
> -
>
> Key: ARROW-9604
> URL: https://issues.apache.org/jira/browse/ARROW-9604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Add benchmark for aggregate min/max compute kernels, similar to sum aggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-08-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176552#comment-17176552
 ] 

Andy Grove commented on ARROW-9707:
---

[~alamb]That would be great. I think [~jorgecarleitao] may also be interested. 
I have already prototyped this out but would welcome a design review before 
making the changes in DataFusion. I will start a Google doc for us to discuss 
this and will post a link here soon.

> [Rust] [DataFusion] Re-implement threading model
> 
>
> Key: ARROW-9707
> URL: https://issues.apache.org/jira/browse/ARROW-9707
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> The current threading model is very simple and does not scale. We currently 
> use 1-2 dedicated threads per partition and they all run simultaneously, 
> which is a huge problem if you have more partitions than logical or physical 
> cores.
> This task is to re-implement the threading model so that query execution uses 
> a fixed (configurable) number of threads. Work will be broken down into 
> stages and tasks and each in-process executor (running on a dedicated thread) 
> will process its queue of tasks.
> This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9710) [C++] Generalize Decimal ToString in preparation for Decimal256

2020-08-12 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9710:
--

 Summary: [C++] Generalize Decimal ToString in preparation for 
Decimal256
 Key: ARROW-9710
 URL: https://issues.apache.org/jira/browse/ARROW-9710
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Mingyu Zhong


Generalize Decimal ToString method in preparation for introducing Decimal256 
bit type (and other bit widths as needed).  

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9679) [Rust] [DataFusion] HashAggregate walks map many times building final batch

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9679.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7936
[https://github.com/apache/arrow/pull/7936]

> [Rust] [DataFusion] HashAggregate walks map many times building final batch
> ---
>
> Key: ARROW-9679
> URL: https://issues.apache.org/jira/browse/ARROW-9679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The current HashAggregate implementation iterates over the final hash map 
> once for each grouping expression and once for each aggregate expression. 
> This is inefficient and possibly dangerous depending on the ordering 
> gaurantees made by the hash map implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9695) [Rust][DataFusion] Improve documentation on LogicalPlan variants

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9695.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7934
[https://github.com/apache/arrow/pull/7934]

> [Rust][DataFusion] Improve documentation on LogicalPlan variants
> 
>
> Key: ARROW-9695
> URL: https://issues.apache.org/jira/browse/ARROW-9695
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I think we could improve the documentation somewhat on LogicalPlan nodes. I 
> will submit a PR with a proposal. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9710) [C++] Generalize Decimal ToString in preparation for Decimal256

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9710:
--
Labels: pull-request-available  (was: )

> [C++] Generalize Decimal ToString in preparation for Decimal256
> ---
>
> Key: ARROW-9710
> URL: https://issues.apache.org/jira/browse/ARROW-9710
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Mingyu Zhong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Generalize Decimal ToString method in preparation for introducing Decimal256 
> bit type (and other bit widths as needed).  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9710) [C++] Generalize Decimal ToString in preparation for Decimal256

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9710:


Assignee: Apache Arrow JIRA Bot  (was: Mingyu Zhong)

> [C++] Generalize Decimal ToString in preparation for Decimal256
> ---
>
> Key: ARROW-9710
> URL: https://issues.apache.org/jira/browse/ARROW-9710
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Generalize Decimal ToString method in preparation for introducing Decimal256 
> bit type (and other bit widths as needed).  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9710) [C++] Generalize Decimal ToString in preparation for Decimal256

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9710:


Assignee: Mingyu Zhong  (was: Apache Arrow JIRA Bot)

> [C++] Generalize Decimal ToString in preparation for Decimal256
> ---
>
> Key: ARROW-9710
> URL: https://issues.apache.org/jira/browse/ARROW-9710
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Mingyu Zhong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Generalize Decimal ToString method in preparation for introducing Decimal256 
> bit type (and other bit widths as needed).  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9676) [R] Option to import structs as lists

2020-08-12 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176581#comment-17176581
 ] 

Neal Richardson commented on ARROW-9676:


Hmm. Your example (with all NAs) doesn't error for me. From the error message, 
it sounds like there's a bug somewhere in the recursive logic that handles the 
struct arrays, but we're going to need something reproducible in order to 
identify and ensure that we fix it. 

Aside: I don't think your original suggestion of importing struct as R list 
instead of data.frame would help: a data.frame is just a list with extra 
attributes. The recursive logic would still be needed to read the struct array, 
so we need to find that bug and fix it.

What is the origin of the Parquet file you're reading? Since you're hitting an 
error about {{SET_STRING_ELT()}} I wonder if you have non-UTF-8 text, or 
perhaps embedded nuls.

> [R] Option to import structs as lists
> -
>
> Key: ARROW-9676
> URL: https://issues.apache.org/jira/browse/ARROW-9676
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 1.0.0
> Environment: Amazon Linux, 32gb of ram
>Reporter: Nick DiQuattro
>Priority: Major
>
> When trying to collect data from a dataset based on parquet files with nested 
> structs (column is a struct with 2 structs nested) of moderate size (1Mish 
> rows), R crashes. If I add a filter to reduce the number of rows, the data is 
> parsed. If I select out the struct column, it works great (up to 21M rows). 
> My hunch is the structs resulting in data.frame columns may be the issue. I 
> am curious if there's a way to have arrow import structs as lists instead of 
> data.frames. Thanks for the direction to here [~neilr8133]!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9676) [R] Error converting Table with nested structs

2020-08-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9676:
---
Summary: [R] Error converting Table with nested structs  (was: [R] Option 
to import structs as lists)

> [R] Error converting Table with nested structs
> --
>
> Key: ARROW-9676
> URL: https://issues.apache.org/jira/browse/ARROW-9676
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Affects Versions: 1.0.0
> Environment: Amazon Linux, 32gb of ram
>Reporter: Nick DiQuattro
>Priority: Major
>
> When trying to collect data from a dataset based on parquet files with nested 
> structs (column is a struct with 2 structs nested) of moderate size (1Mish 
> rows), R crashes. If I add a filter to reduce the number of rows, the data is 
> parsed. If I select out the struct column, it works great (up to 21M rows). 
> My hunch is the structs resulting in data.frame columns may be the issue. I 
> am curious if there's a way to have arrow import structs as lists instead of 
> data.frames. Thanks for the direction to here [~neilr8133]!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9694) [Ruby] can't install red-arrow-gsl

2020-08-12 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-9694:
---

Assignee: Kouhei Sutou

> [Ruby] can't install red-arrow-gsl
> --
>
> Key: ARROW-9694
> URL: https://issues.apache.org/jira/browse/ARROW-9694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 1.0.0
> Environment: windows, msys2, 
>Reporter: Dominic Sisneros
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: gem_make.out
>
>
> {noformat}
> f:\programming\source\repos\ruby\try_arrow>gem install red-arrow-gsl
> Temporarily enhancing PATH for MSYS/MINGW...
> Building native extensions. This could take a while...
> ERROR:  Error installing red-arrow-gsl:
> ERROR: Failed to build gem native extension.
> current directory: 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3/ext/gsl_native
> F:/windows/scoop/apps/ruby/2.7.1-1/bin/ruby.exe -I 
> F:/windows/scoop/apps/ruby/2.7.1-1/lib/ruby/site_ruby/2.7.0 -r 
> ./siteconf20200811-28480-149f31i.rb extconf.rb
> sh: gsl-config: No such file or directory
> *** ERROR: missing required library to compile this module: undefined method 
> `chomp' for nil:NilClass
> *** extconf.rb failed ***
> Could not create Makefile due to some reason, probably lack of necessary
> libraries and/or headers.  Check the mkmf.log file for more details.  You may
> need configuration options.
> Provided configuration options:
> --with-opt-dir
> --without-opt-dir
> --with-opt-include
> --without-opt-include=${opt-dir}/include
> --with-opt-lib
> --without-opt-lib=${opt-dir}/lib
> --with-make-prog
> --without-make-prog
> --srcdir=.
> --curdir
> --ruby=F:/windows/scoop/apps/ruby/2.7.1-1/bin/$(RUBY_BASE_NAME)
> --with-gsl-version
> extconf failed, exit code 1
> Gem files will remain installed in 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3 for inspection.
> Results logged to 
> F:/windows/scoop/persist/ruby/gems/extensions/x64-mingw32/2.7.0/gsl-2.1.0.3/gem_make.out
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9711) [Rust] Add benchmark based on TPC-H

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9711:
-

 Summary: [Rust] Add benchmark based on TPC-H
 Key: ARROW-9711
 URL: https://issues.apache.org/jira/browse/ARROW-9711
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


We need better benchmarks for testing at scale, so TPC benchmarks are ideal 
since data can be generated at different scale factors. TPC-H seems like a good 
fit for Arrow so I would like to contribute a benchmark based on that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9711) [Rust] Add benchmark based on TPC-H

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9711:
--
Labels: pull-request-available  (was: )

> [Rust] Add benchmark based on TPC-H
> ---
>
> Key: ARROW-9711
> URL: https://issues.apache.org/jira/browse/ARROW-9711
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need better benchmarks for testing at scale, so TPC benchmarks are ideal 
> since data can be generated at different scale factors. TPC-H seems like a 
> good fit for Arrow so I would like to contribute a benchmark based on that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9711) [Rust] Add benchmark based on TPC-H

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9711:


Assignee: Andy Grove  (was: Apache Arrow JIRA Bot)

> [Rust] Add benchmark based on TPC-H
> ---
>
> Key: ARROW-9711
> URL: https://issues.apache.org/jira/browse/ARROW-9711
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need better benchmarks for testing at scale, so TPC benchmarks are ideal 
> since data can be generated at different scale factors. TPC-H seems like a 
> good fit for Arrow so I would like to contribute a benchmark based on that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9711) [Rust] Add benchmark based on TPC-H

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9711:


Assignee: Apache Arrow JIRA Bot  (was: Andy Grove)

> [Rust] Add benchmark based on TPC-H
> ---
>
> Key: ARROW-9711
> URL: https://issues.apache.org/jira/browse/ARROW-9711
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need better benchmarks for testing at scale, so TPC benchmarks are ideal 
> since data can be generated at different scale factors. TPC-H seems like a 
> good fit for Arrow so I would like to contribute a benchmark based on that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9712) [Rust] [DataFusion] ParquetScanExec panics on error

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9712:
-

 Summary: [Rust] [DataFusion] ParquetScanExec panics on error
 Key: ARROW-9712
 URL: https://issues.apache.org/jira/browse/ARROW-9712
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


ParquetScanExec panics on error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9712) [Rust] [DataFusion] ParquetScanExec panics on error

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9712:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] ParquetScanExec panics on error
> ---
>
> Key: ARROW-9712
> URL: https://issues.apache.org/jira/browse/ARROW-9712
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ParquetScanExec panics on error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9712) [Rust] [DataFusion] ParquetScanExec panics on error

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9712:


Assignee: Andy Grove  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] ParquetScanExec panics on error
> ---
>
> Key: ARROW-9712
> URL: https://issues.apache.org/jira/browse/ARROW-9712
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ParquetScanExec panics on error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9712) [Rust] [DataFusion] ParquetScanExec panics on error

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9712:


Assignee: Apache Arrow JIRA Bot  (was: Andy Grove)

> [Rust] [DataFusion] ParquetScanExec panics on error
> ---
>
> Key: ARROW-9712
> URL: https://issues.apache.org/jira/browse/ARROW-9712
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ParquetScanExec panics on error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9707) [Rust] [DataFusion] Re-implement threading model

2020-08-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176635#comment-17176635
 ] 

Andy Grove commented on ARROW-9707:
---

Here's the Google doc: 
https://docs.google.com/document/d/1NUiIKxgdiKrEv1H4JXVmk_nxq-eG9et7LKPD91Du-Io/edit?usp=sharing

> [Rust] [DataFusion] Re-implement threading model
> 
>
> Key: ARROW-9707
> URL: https://issues.apache.org/jira/browse/ARROW-9707
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> The current threading model is very simple and does not scale. We currently 
> use 1-2 dedicated threads per partition and they all run simultaneously, 
> which is a huge problem if you have more partitions than logical or physical 
> cores.
> This task is to re-implement the threading model so that query execution uses 
> a fixed (configurable) number of threads. Work will be broken down into 
> stages and tasks and each in-process executor (running on a dedicated thread) 
> will process its queue of tasks.
> This process will be driven by a scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9713) [Rust][DataFusion] Remove explicit panics

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9713:
-

 Summary: [Rust][DataFusion] Remove explicit panics
 Key: ARROW-9713
 URL: https://issues.apache.org/jira/browse/ARROW-9713
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


There are two explicit panics in the datafusion codebase. We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9713) [Rust][DataFusion] Remove explicit panics

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9713:
--
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Remove explicit panics
> -
>
> Key: ARROW-9713
> URL: https://issues.apache.org/jira/browse/ARROW-9713
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are two explicit panics in the datafusion codebase. We should remove 
> them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9713) [Rust][DataFusion] Remove explicit panics

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9713:


Assignee: Apache Arrow JIRA Bot  (was: Andy Grove)

> [Rust][DataFusion] Remove explicit panics
> -
>
> Key: ARROW-9713
> URL: https://issues.apache.org/jira/browse/ARROW-9713
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are two explicit panics in the datafusion codebase. We should remove 
> them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9713) [Rust][DataFusion] Remove explicit panics

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9713:


Assignee: Andy Grove  (was: Apache Arrow JIRA Bot)

> [Rust][DataFusion] Remove explicit panics
> -
>
> Key: ARROW-9713
> URL: https://issues.apache.org/jira/browse/ARROW-9713
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are two explicit panics in the datafusion codebase. We should remove 
> them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9667) [CI][Crossbow] Segfault in 2 nightly R builds

2020-08-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9667:
---
Fix Version/s: (was: 1.0.1)

> [CI][Crossbow] Segfault in 2 nightly R builds
> -
>
> Key: ARROW-9667
> URL: https://issues.apache.org/jira/browse/ARROW-9667
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Yibo Cai
>Priority: Major
> Fix For: 2.0.0
>
>
> {code}
>  *** caught illegal operation ***
>   address 0x7f9a07216687, cause 'illegal operand'
> {code}
> when calling compute__CallFunction("is_null") on an Int32Array. This happens 
> to be in the first test of the test suite, so the specific action is probably 
> not relevant: 
> https://github.com/apache/arrow/blob/master/r/tests/testthat/test-Array.R#L49
> This is happening on 
> test-r-rstudio-r-base-3.6-bionic
> test-r-rstudio-r-base-3.6-opensuse15
> but not on 
> test-r-linux-as-cran
> test-r-rhub-ubuntu-gcc-release
> test-r-rocker-r-base-latest
> test-r-rstudio-r-base-3.6-centos6
> test-r-rstudio-r-base-3.6-centos8
> test-r-rstudio-r-base-3.6-opensuse42
> or the builds we do on every commit (centos7 and a different ubuntu bionic).
> bionic started failing on July 31, and opensuse15 started failing on August 
> 1, so it's possible that there was a change to those containers upstream that 
> caused the issue and not a code change of ours.,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9714:
-

 Summary: [Rust] [DataFusion] TypeCoercionRule not implemented for 
Limit or Sort
 Key: ARROW-9714
 URL: https://issues.apache.org/jira/browse/ARROW-9714
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9714:


Assignee: Apache Arrow JIRA Bot  (was: Andy Grove)

> [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
> --
>
> Key: ARROW-9714
> URL: https://issues.apache.org/jira/browse/ARROW-9714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
> fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9714:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
> --
>
> Key: ARROW-9714
> URL: https://issues.apache.org/jira/browse/ARROW-9714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
> fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9714:


Assignee: Andy Grove  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
> --
>
> Key: ARROW-9714
> URL: https://issues.apache.org/jira/browse/ARROW-9714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
> fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9715) [R] changelog/doc updates for 1.0.1

2020-08-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9715:
--

 Summary: [R] changelog/doc updates for 1.0.1
 Key: ARROW-9715
 URL: https://issues.apache.org/jira/browse/ARROW-9715
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.1, 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9715) [R] changelog/doc updates for 1.0.1

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9715:
--
Labels: pull-request-available  (was: )

> [R] changelog/doc updates for 1.0.1
> ---
>
> Key: ARROW-9715
> URL: https://issues.apache.org/jira/browse/ARROW-9715
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9715) [R] changelog/doc updates for 1.0.1

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9715:


Assignee: Apache Arrow JIRA Bot  (was: Neal Richardson)

> [R] changelog/doc updates for 1.0.1
> ---
>
> Key: ARROW-9715
> URL: https://issues.apache.org/jira/browse/ARROW-9715
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9715) [R] changelog/doc updates for 1.0.1

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9715:


Assignee: Neal Richardson  (was: Apache Arrow JIRA Bot)

> [R] changelog/doc updates for 1.0.1
> ---
>
> Key: ARROW-9715
> URL: https://issues.apache.org/jira/browse/ARROW-9715
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9712) [Rust] [DataFusion] ParquetScanExec panics on error

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9712:
--
Fix Version/s: 1.0.1

> [Rust] [DataFusion] ParquetScanExec panics on error
> ---
>
> Key: ARROW-9712
> URL: https://issues.apache.org/jira/browse/ARROW-9712
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ParquetScanExec panics on error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9714) [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort

2020-08-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9714:
--
Fix Version/s: 1.0.1

> [Rust] [DataFusion] TypeCoercionRule not implemented for Limit or Sort
> --
>
> Key: ARROW-9714
> URL: https://issues.apache.org/jira/browse/ARROW-9714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> TypeCoercionRule not implemented for Limit or Sort, causing TPC-H query 1 to 
> fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9715) [R] changelog/doc updates for 1.0.1

2020-08-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9715.

Resolution: Fixed

Issue resolved by pull request 7950
[https://github.com/apache/arrow/pull/7950]

> [R] changelog/doc updates for 1.0.1
> ---
>
> Key: ARROW-9715
> URL: https://issues.apache.org/jira/browse/ARROW-9715
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.0.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit

2020-08-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-9716:
-

 Summary: [Rust] [DataFusion] MergeExec  should have concurrency 
limit
 Key: ARROW-9716
 URL: https://issues.apache.org/jira/browse/ARROW-9716
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 1.0.1, 2.0.0


MergeExec currently spins up one thread per input partition which causes apps 
to effectively hang if there are substantially more partitions than available 
cores.

We can implement a configurable limit here pretty easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit

2020-08-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9716:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] MergeExec  should have concurrency limit
> 
>
> Key: ARROW-9716
> URL: https://issues.apache.org/jira/browse/ARROW-9716
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> MergeExec currently spins up one thread per input partition which causes apps 
> to effectively hang if there are substantially more partitions than available 
> cores.
> We can implement a configurable limit here pretty easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9716) [Rust] [DataFusion] MergeExec should have concurrency limit

2020-08-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9716:


Assignee: Andy Grove  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] MergeExec  should have concurrency limit
> 
>
> Key: ARROW-9716
> URL: https://issues.apache.org/jira/browse/ARROW-9716
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> MergeExec currently spins up one thread per input partition which causes apps 
> to effectively hang if there are substantially more partitions than available 
> cores.
> We can implement a configurable limit here pretty easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >