date:20200423

[jira] [Created] (ARROW-8580) Pyarrow exceptions are not helpful

2020-04-23 Thread Soroush Radpour (Jira)

Soroush Radpour created ARROW-8580:
--

 Summary: Pyarrow exceptions are not helpful
 Key: ARROW-8580
 URL: https://issues.apache.org/jira/browse/ARROW-8580
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Soroush Radpour


I'm trying to understand an exception in the code using pyarrow, and it is not 
very helpful.
 {{  File "pyarrow/_parquet.pyx", line 1036, in 
pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: IOError: b'Service Unavailable'. Detail: Python exception: 
RuntimeError}}
 
 It would be great if each of the three exceptions was unwrapped with full 
stack trace and error messages that came with it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8473) [Rust] "Statistics support" in rust/parquet readme is incorrect

2020-04-23 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved ARROW-8473.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6951
[https://github.com/apache/arrow/pull/6951]

> [Rust] "Statistics support" in rust/parquet readme is incorrect
> ---
>
> Key: ARROW-8473
> URL: https://issues.apache.org/jira/browse/ARROW-8473
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Krzysztof Stanisławek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Statistics are not actually supported in rust implementation of parquet. See 
> [https://github.com/apache/arrow/blob/3e3712a14a3242d70145fb9d3d6f0f4b8c374e68/rust/parquet/src/column/writer.rs#L522]
>  or similar lines in this file, or writer.rs.
> https://github.com/apache/arrow/pull/6951



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8579) [C++] AVX512 part for SIMD operations of DecodeSpaced/EncodeSpaced

2020-04-23 Thread Frank Du (Jira)

Frank Du created ARROW-8579:
---

 Summary: [C++] AVX512 part for SIMD operations of 
DecodeSpaced/EncodeSpaced
 Key: ARROW-8579
 URL: https://issues.apache.org/jira/browse/ARROW-8579
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


As part of https://issues.apache.org/jira/browse/PARQUET-1841, AVX512 path 
identified with the helper of mask_compress_/mask_expand_  API.

This Jira created for spaced benchmark, unittest and AVX512 path and other 
basic support of further potential SIMD chance of SSE/AVX2. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device

2020-04-23 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-8577:
---

Assignee: Kouhei Sutou

> [GLib][Plasma] gplasma_client_options_new() default settings are enabling a 
> check for CUDA device
> -
>
> Key: ARROW-8577
> URL: https://issues.apache.org/jira/browse/ARROW-8577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Tanveer
>Assignee: Kouhei Sutou
>Priority: Major
>
> Hi all,
>  Previously, I was using c_glib Plasma library (build 0.12) for creating 
> plasma objects. It was working as expected. But now I want to use Arrow's 
> newest build.  I incurred the following error:
>  
> /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on 
> an error: IOError: Cuda error 100 in function 'cuInit': 
> [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected
> I think plasma client options (gplasma_client_options_new()) which I am using 
> with default settings are enabling a check for my CUDA device and I have no 
> CUDA device attached to my system. How I can disable this check? Any help 
> will be highly appreciated. Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device

2020-04-23 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091124#comment-17091124
 ] 

Kouhei Sutou commented on ARROW-8577:
-

Could you show a program that reproduces this problem?

> [GLib][Plasma] gplasma_client_options_new() default settings are enabling a 
> check for CUDA device
> -
>
> Key: ARROW-8577
> URL: https://issues.apache.org/jira/browse/ARROW-8577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Tanveer
>Priority: Major
>
> Hi all,
>  Previously, I was using c_glib Plasma library (build 0.12) for creating 
> plasma objects. It was working as expected. But now I want to use Arrow's 
> newest build.  I incurred the following error:
>  
> /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on 
> an error: IOError: Cuda error 100 in function 'cuInit': 
> [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected
> I think plasma client options (gplasma_client_options_new()) which I am using 
> with default settings are enabling a check for my CUDA device and I have no 
> CUDA device attached to my system. How I can disable this check? Any help 
> will be highly appreciated. Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"

2020-04-23 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091125#comment-17091125
 ] 

Wes McKinney commented on ARROW-8578:
-

Hm, I rebooted my laptop and it works out, so the warning above may be a red 
herring. It's curious that something would go wrong with my networking

> [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on 
> compiling system"
> 
>
> Key: ARROW-8578
> URL: https://issues.apache.org/jira/browse/ARROW-8578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Tried compiling and running this today  (with grpc 1.28.1)
> {code}
> $ release/arrow-flight-benchmark 
> Using standalone server: false
> Server running with pid 22385
> Testing method: DoGet
> Server host: localhost
> Server port: 31337
> E0423 21:54:15.174285695   22385 socket_utils_common_posix.cc:222] check for 
> SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT 
> unavailable on compiling 
> system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
> Server host: localhost
> {code}
> my Linux kernel
> {code}
> $ uname -a
> Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 
> x86_64 x86_64 GNU/Linux
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device

2020-04-23 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8577:

Summary: [GLib][Plasma] gplasma_client_options_new() default settings are 
enabling a check for CUDA device  (was: [CGlib Plasma] 
gplasma_client_options_new() default settings are enabling a check for CUDA 
device)

> [GLib][Plasma] gplasma_client_options_new() default settings are enabling a 
> check for CUDA device
> -
>
> Key: ARROW-8577
> URL: https://issues.apache.org/jira/browse/ARROW-8577
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
>
> Hi all,
>  Previously, I was using c_glib Plasma library (build 0.12) for creating 
> plasma objects. It was working as expected. But now I want to use Arrow's 
> newest build.  I incurred the following error:
>  
> /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on 
> an error: IOError: Cuda error 100 in function 'cuInit': 
> [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected
> I think plasma client options (gplasma_client_options_new()) which I am using 
> with default settings are enabling a check for my CUDA device and I have no 
> CUDA device attached to my system. How I can disable this check? Any help 
> will be highly appreciated. Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8577) [GLib][Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device

2020-04-23 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8577:

Component/s: GLib

> [GLib][Plasma] gplasma_client_options_new() default settings are enabling a 
> check for CUDA device
> -
>
> Key: ARROW-8577
> URL: https://issues.apache.org/jira/browse/ARROW-8577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Tanveer
>Priority: Major
>
> Hi all,
>  Previously, I was using c_glib Plasma library (build 0.12) for creating 
> plasma objects. It was working as expected. But now I want to use Arrow's 
> newest build.  I incurred the following error:
>  
> /build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on 
> an error: IOError: Cuda error 100 in function 'cuInit': 
> [CUDA_ERROR_NO_DEVICE] no CUDA-capable device is detected
> I think plasma client options (gplasma_client_options_new()) which I am using 
> with default settings are enabling a check for my CUDA device and I have no 
> CUDA device attached to my system. How I can disable this check? Any help 
> will be highly appreciated. Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"

2020-04-23 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091119#comment-17091119
 ] 

Wes McKinney commented on ARROW-8578:
-

The executable appears to just hang, which is also a bad failure mode

> [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on 
> compiling system"
> 
>
> Key: ARROW-8578
> URL: https://issues.apache.org/jira/browse/ARROW-8578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Tried compiling and running this today  (with grpc 1.28.1)
> {code}
> $ release/arrow-flight-benchmark 
> Using standalone server: false
> Server running with pid 22385
> Testing method: DoGet
> Server host: localhost
> Server port: 31337
> E0423 21:54:15.174285695   22385 socket_utils_common_posix.cc:222] check for 
> SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT 
> unavailable on compiling 
> system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
> Server host: localhost
> {code}
> my Linux kernel
> {code}
> $ uname -a
> Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 
> x86_64 x86_64 GNU/Linux
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"

2020-04-23 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091115#comment-17091115
 ] 

Wes McKinney commented on ARROW-8578:
-

[~lidavidm] do you know what this is about?

> [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on 
> compiling system"
> 
>
> Key: ARROW-8578
> URL: https://issues.apache.org/jira/browse/ARROW-8578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Tried compiling and running this today  (with grpc 1.28.1)
> {code}
> $ release/arrow-flight-benchmark 
> Using standalone server: false
> Server running with pid 22385
> Testing method: DoGet
> Server host: localhost
> Server port: 31337
> E0423 21:54:15.174285695   22385 socket_utils_common_posix.cc:222] check for 
> SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT 
> unavailable on compiling 
> system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
> Server host: localhost
> {code}
> my Linux kernel
> {code}
> $ uname -a
> Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 
> x86_64 x86_64 GNU/Linux
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"

2020-04-23 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-8578:
---

 Summary: [C++][Flight] Test executable failures due to 
"SO_REUSEPORT unavailable on compiling system"
 Key: ARROW-8578
 URL: https://issues.apache.org/jira/browse/ARROW-8578
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Wes McKinney
 Fix For: 1.0.0


Tried compiling and running this today  (with grpc 1.28.1)

{code}
$ release/arrow-flight-benchmark 
Using standalone server: false
Server running with pid 22385
Testing method: DoGet
Server host: localhost
Server port: 31337
E0423 21:54:15.174285695   22385 socket_utils_common_posix.cc:222] check for 
SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT 
unavailable on compiling 
system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
Server host: localhost
{code}

my Linux kernel

{code}
$ uname -a
Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 x86_64 
x86_64 GNU/Linux
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8577) [CGlib Plasma] gplasma_client_options_new() default settings are enabling a check for CUDA device

2020-04-23 Thread Tanveer (Jira)

Tanveer created ARROW-8577:
--

 Summary: [CGlib Plasma] gplasma_client_options_new() default 
settings are enabling a check for CUDA device
 Key: ARROW-8577
 URL: https://issues.apache.org/jira/browse/ARROW-8577
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Tanveer


Hi all,

 Previously, I was using c_glib Plasma library (build 0.12) for creating plasma 
objects. It was working as expected. But now I want to use Arrow's newest 
build.  I incurred the following error:

 
/build/apache-arrow-0.17.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an 
error: IOError: Cuda error 100 in function 'cuInit': [CUDA_ERROR_NO_DEVICE] no 
CUDA-capable device is detected
I think plasma client options (gplasma_client_options_new()) which I am using 
with default settings are enabling a check for my CUDA device and I have no 
CUDA device attached to my system. How I can disable this check? Any help will 
be highly appreciated. Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8576) [Rust] Implement ArrayEqual for UnionArray

2020-04-23 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-8576:
--

 Summary: [Rust] Implement ArrayEqual for UnionArray
 Key: ARROW-8576
 URL: https://issues.apache.org/jira/browse/ARROW-8576
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8516) [Rust] Slow BufferBuilder inserts within PrimitiveBuilder::append_slice

2020-04-23 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan resolved ARROW-8516.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6980
[https://github.com/apache/arrow/pull/6980]

> [Rust] Slow BufferBuilder inserts within 
> PrimitiveBuilder::append_slice
> 
>
> Key: ARROW-8516
> URL: https://issues.apache.org/jira/browse/ARROW-8516
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {color:#00}BufferBuilder{color}{color:#0073bf}<{color}{color:#00}BooleanType>::append_slice
>  is called by PrimitiveBuilder{color}{color:#00}::append_slice with a 
> constructed vector of true values. {color}
> {color:#00}Even in release builds the associated allocations and 
> iterations are not optimised out, resulting in a third of the time to parse a 
> parquet file containing single integers being spent in 
> PrimitiveBuilder::append_slice.{color}
> {color:#00}This PR adds an append_n method to the BufferBuilderTrait that 
> allows this to be handled more efficiently. My rather unscientific testing 
> shows it to halve the amount of time spent in this method yielding an ~20% 
> speedup for my particular workload.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8552) [Rust] support column iteration for parquet row

2020-04-23 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan resolved ARROW-8552.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7009
[https://github.com/apache/arrow/pull/7009]

> [Rust] support column iteration for parquet row
> ---
>
> Key: ARROW-8552
> URL: https://issues.apache.org/jira/browse/ARROW-8552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It would be useful to be able to iterate through all the columns in a parquet 
> row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8552) [Rust] support column iteration for parquet row

2020-04-23 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan reassigned ARROW-8552:
--

Assignee: QP Hou

> [Rust] support column iteration for parquet row
> ---
>
> Key: ARROW-8552
> URL: https://issues.apache.org/jira/browse/ARROW-8552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It would be useful to be able to iterate through all the columns in a parquet 
> row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8541) [Release] Don't remove previous source releases automatically

2020-04-23 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8541.
-
Resolution: Fixed

Issue resolved by pull request 6998
[https://github.com/apache/arrow/pull/6998]

> [Release] Don't remove previous source releases automatically
> -
>
> Key: ARROW-8541
> URL: https://issues.apache.org/jira/browse/ARROW-8541
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should keep at least the last three source tarballs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8575) [Developer] Add issue_comment workflow to rebase a PR

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8575:
--
Labels: pull-request-available  (was: )

> [Developer] Add issue_comment workflow to rebase a PR
> -
>
> Key: ARROW-8575
> URL: https://issues.apache.org/jira/browse/ARROW-8575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8575) [Developer] Add issue_comment workflow to rebase a PR

2020-04-23 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-8575:
--

 Summary: [Developer] Add issue_comment workflow to rebase a PR
 Key: ARROW-8575
 URL: https://issues.apache.org/jira/browse/ARROW-8575
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8564) [Website] Add Ubuntu 20.04 LTS to supported package list

2020-04-23 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8564.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 56
[https://github.com/apache/arrow-site/pull/56]

> [Website] Add Ubuntu 20.04 LTS to supported package list
> 
>
> Key: ARROW-8564
> URL: https://issues.apache.org/jira/browse/ARROW-8564
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7950) [Python] When initializing pandas API shim, inform user if their installed pandas version is too old

2020-04-23 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7950.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6992
[https://github.com/apache/arrow/pull/6992]

> [Python] When initializing pandas API shim, inform user if their installed 
> pandas version is too old
> 
>
> Key: ARROW-7950
> URL: https://issues.apache.org/jira/browse/ARROW-7950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8572) [Python] Expose UnionArray.array and other fields

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8572:
--
Labels: pull-request-available  (was: )

> [Python] Expose UnionArray.array and other fields
> -
>
> Key: ARROW-8572
> URL: https://issues.apache.org/jira/browse/ARROW-8572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently in Python, you can construct a UnionArray easily, but getting the 
> data back out (without copying) is near-impossible. We should expose the 
> getter for UnionArray.array so we can pull out the constituent arrays. We 
> should also expose fields like mode while we're at it.
> The use case is: in Flight, we'd like to write multiple distinct datasets 
> (with distinct schemas) in a single logical call; using UnionArrays lets us 
> combine these datasets into a single logical dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7391) [Python] Remove unnecessary classes from the binding layer

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7391:
--
Labels: dataset pull-request-available  (was: dataset)

> [Python] Remove unnecessary classes from the binding layer
> --
>
> Key: ARROW-7391
> URL: https://issues.apache.org/jira/browse/ARROW-7391
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Several Python classes introduced by 
> https://github.com/apache/arrow/pull/5237 are unnecessary and can be removed 
> in favor of simple functions which produce opaque pointers, including the 
> PartitionScheme and Expression classes. These should be removed to reduce 
> cognitive overhead of the Python datasets API and to loosen coupling between 
> Python and C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8566) [R] error when writing POSIXct to spark

2020-04-23 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090916#comment-17090916
 ] 

Neal Richardson commented on ARROW-8566:


Great, thanks for debugging with me. I created 
https://github.com/sparklyr/sparklyr/issues/2439 because I think the current 
{{arrow}} behavior is correct (certainly the 0.16 behavior was not correct, 
unless you happen to live in UTC) so this might need to be worked around in 
{{sparklyr}}. 

> [R] error when writing POSIXct to spark
> ---
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Major
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> #> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> #> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> #> at java.lang.reflect.Method.invoke(Method.java:498)
> #> at sparklyr.Invoke.invoke(invoke.scala:147)
> #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
> #> at sparklyr.StreamHandler.read(stream.scala:61)
> #> at 
> sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
> #> at scala.util.control.Breaks.breakable(Breaks.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:14)
> #> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> #> at 
> io.netty.channel.AbstractChannelHandl

[jira] [Updated] (ARROW-2260) [C++][Plasma] plasma_store should show usage

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2260:
--
Labels: pull-request-available  (was: )

> [C++][Plasma] plasma_store should show usage
> 
>
> Key: ARROW-2260
> URL: https://issues.apache.org/jira/browse/ARROW-2260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the options exposed by the {{plasma_store}} executable aren't very 
> discoverable:
> {code:bash}
> $ plasma_store -h
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store 
> please specify socket for incoming connections with -s switch
> Abandon
> (pyarrow) antoine@fsol:~/arrow/cpp (ARROW-2135-nan-conversion-when-casting 
> *)$ plasma_store --help
> plasma_store: invalid option -- '-'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8574) [Rust] Implement Debug for all plain types

2020-04-23 Thread Mahmut Bulut (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut updated ARROW-8574:

Description: Currently, no type plain type (like RecordBatch) or any array 
implementation implements debug. So peeking into columns and looking to 
metadata quite a bit cumbersome.  (was: Currently, no type plain type (like 
RecordBatch) or any array implementation implements debug. So peeking into 
columns and looking to metadata quite a bit cumbersome.

 

If no objection arises, I would like to implement Debug for major plain structs 
around.)

> [Rust] Implement Debug for all plain types 
> ---
>
> Key: ARROW-8574
> URL: https://issues.apache.org/jira/browse/ARROW-8574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> Currently, no type plain type (like RecordBatch) or any array implementation 
> implements debug. So peeking into columns and looking to metadata quite a bit 
> cumbersome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8574) [Rust] Implement Debug for all plain types

2020-04-23 Thread Mahmut Bulut (Jira)

Mahmut Bulut created ARROW-8574:
---

 Summary: [Rust] Implement Debug for all plain types 
 Key: ARROW-8574
 URL: https://issues.apache.org/jira/browse/ARROW-8574
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mahmut Bulut
Assignee: Mahmut Bulut


Currently, no type plain type (like RecordBatch) or any array implementation 
implements debug. So peeking into columns and looking to metadata quite a bit 
cumbersome.

 

If no objection arises, I would like to implement Debug for major plain structs 
around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8574) [Rust] Implement Debug for all plain types

2020-04-23 Thread Mahmut Bulut (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut updated ARROW-8574:

Component/s: Rust

> [Rust] Implement Debug for all plain types 
> ---
>
> Key: ARROW-8574
> URL: https://issues.apache.org/jira/browse/ARROW-8574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> Currently, no type plain type (like RecordBatch) or any array implementation 
> implements debug. So peeking into columns and looking to metadata quite a bit 
> cumbersome.
>  
> If no objection arises, I would like to implement Debug for major plain 
> structs around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8574) [Rust] Implement Debug for all plain types

2020-04-23 Thread Mahmut Bulut (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090865#comment-17090865
 ] 

Mahmut Bulut commented on ARROW-8574:
-

If no objection arises, I would like to implement Debug for major plain structs 
around.

> [Rust] Implement Debug for all plain types 
> ---
>
> Key: ARROW-8574
> URL: https://issues.apache.org/jira/browse/ARROW-8574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> Currently, no type plain type (like RecordBatch) or any array implementation 
> implements debug. So peeking into columns and looking to metadata quite a bit 
> cumbersome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8573) [Rust] Upgrade to Rust 1.44 nightly

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8573:
--
Labels: pull-request-available  (was: )

> [Rust] Upgrade to Rust 1.44 nightly
> ---
>
> Key: ARROW-8573
> URL: https://issues.apache.org/jira/browse/ARROW-8573
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Rust 1.43.0 was just released, so we should update to 1.44 nightly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8573) [Rust] Upgrade to Rust 1.44 nightly

2020-04-23 Thread Andy Grove (Jira)

Andy Grove created ARROW-8573:
-

 Summary: [Rust] Upgrade to Rust 1.44 nightly
 Key: ARROW-8573
 URL: https://issues.apache.org/jira/browse/ARROW-8573
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 1.0.0


Rust 1.43.0 was just released, so we should update to 1.44 nightly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8559) [Rust] Consolidate Record Batch iterator traits in main arrow crate

2020-04-23 Thread Mahmut Bulut (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090863#comment-17090863
 ] 

Mahmut Bulut commented on ARROW-8559:
-

This looks good. I am perfectly ok with this.

> [Rust] Consolidate Record Batch iterator traits in main arrow crate
> ---
>
> Key: ARROW-8559
> URL: https://issues.apache.org/jira/browse/ARROW-8559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>
> We have the `BatchIterator` trait in DataFusion and the `RecordBatchReader` 
> trait in the main arrow crate.
> They differ in that `BatchIterator` is Send + Sync.  They should both be in 
> the Arrow crate and be named `BatchIterator` and `SendableBatchIterator`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8566) [R] error when writing POSIXct to spark

2020-04-23 Thread Curt Bergmann (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090861#comment-17090861
 ] 

Curt Bergmann commented on ARROW-8566:
--

Assigning the timezone solved the problem. Even though Sys.time() when printed 
shows a time zone, apparently the tzone attribute is not set. When I set it 
then I have success writing to the file. The column type that gets created also 
comes back as a posixct. Following is a run that shows the failure followed by 
success. To avoid re-showing the long java trace I just print "Failed" for when 
it fails.

Thank you!

library(DBI)
library(sparklyr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp
sc <- spark_connect(master = "local")
sparklyr::spark_version(sc)
#> [1] '2.4.4'
Sys.timezone()
#> [1] "America/Chicago"
x <- data.frame(y = Sys.time())
x$y
#> [1] "2020-04-23 14:17:18 CDT"
lubridate::tz(x$y)
#> [1] ""
tryCatch(dbWriteTable(sc, "test_posixct", x), error = function(e) 
print("Failed"))
#> [1] "Failed"
attr(x$y, "tzone") <- Sys.timezone()
x$y
#> [1] "2020-04-23 14:17:18 CDT"
lubridate::tz(x$y)
#> [1] "America/Chicago"
dbWriteTable(sc, "test_posixct", x)
result_df <- dbReadTable(sc, "test_posixct")
lubridate::tz(x$y)
#> [1] "America/Chicago"

> [R] error when writing POSIXct to spark
> ---
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Major
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[jira] [Created] (ARROW-8572) [Python] Expose UnionArray.array and other fields

2020-04-23 Thread David Li (Jira)

David Li created ARROW-8572:
---

 Summary: [Python] Expose UnionArray.array and other fields
 Key: ARROW-8572
 URL: https://issues.apache.org/jira/browse/ARROW-8572
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.17.0
Reporter: David Li
Assignee: David Li


Currently in Python, you can construct a UnionArray easily, but getting the 
data back out (without copying) is near-impossible. We should expose the getter 
for UnionArray.array so we can pull out the constituent arrays. We should also 
expose fields like mode while we're at it.

The use case is: in Flight, we'd like to write multiple distinct datasets (with 
distinct schemas) in a single logical call; using UnionArrays lets us combine 
these datasets into a single logical dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets

2020-04-23 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8508.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7006
[https://github.com/apache/arrow/pull/7006]

> [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
> 
>
> Key: ARROW-8508
> URL: https://issues.apache.org/jira/browse/ARROW-8508
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Christian Beilschmidt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I created an example of storing multi points with Arrow.
>  # A coordinate consists of two floats (Float64Builder)
>  # A multi point consists of one or more coordinates (FixedSizeListBuilder)
>  # A list of multi points consists of multiple multi points (ListBuilder)
> This is the corresponding code snippet:
> {code:java}
> let float_builder = arrow::array::Float64Builder::new(0);
> let coordinate_builder = 
> arrow::array::FixedSizeListBuilder::new(float_builder, 2);
> let mut multi_point_builder = 
> arrow::array::ListBuilder::new(coordinate_builder);
> multi_point_builder
> .values()
> .values()
> .append_slice(&[0.0, 0.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[1.0, 1.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // first multi point
> multi_point_builder
> .values()
> .values()
> .append_slice(&[2.0, 2.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[3.0, 3.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[4.0, 4.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // second multi point
> let multi_point = dbg!(multi_point_builder.finish());
> let first_multi_point_ref = multi_point.value(0);
> let first_multi_point: &arrow::array::FixedSizeListArray = 
> first_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = first_multi_point.values();
> let coordinates: &Float64Array = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]);
> let second_multi_point_ref = multi_point.value(1);
> let second_multi_point: &arrow::array::FixedSizeListArray = 
> second_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = second_multi_point.values();
> let coordinates: &Float64Array = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, 
> 4.1]);
> {code}
> The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, 
> 2.1]}}.
> Moreover, the debug output produced from {{dbg!}} confirms this:
> {noformat}
> [
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
> ],
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
>   PrimitiveArray
> [
>   2.0,
>   2.1,
> ],
> ],
> ]{noformat}
> The second list should contain the values 2-4.
>  
> So either I am using the builder wrong or there is a bug with the offsets. I 
> used {{0.16}} as well as the current {{master}} from GitHub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8566) [R] error when writing POSIXct to spark

2020-04-23 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8566:
---
Summary: [R] error when writing POSIXct to spark  (was: Upgraded from r 
package arrow 16 to r package arrow 17 and now get an error when writing 
posixct to spark)

> [R] error when writing POSIXct to spark
> ---
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Major
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> #> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> #> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> #> at java.lang.reflect.Method.invoke(Method.java:498)
> #> at sparklyr.Invoke.invoke(invoke.scala:147)
> #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
> #> at sparklyr.StreamHandler.read(stream.scala:61)
> #> at 
> sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
> #> at scala.util.control.Breaks.breakable(Breaks.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:14)
> #> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> #> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
> #> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)

[jira] [Updated] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark

2020-04-23 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8566:
---
Priority: Major  (was: Blocker)

> Upgraded from r package arrow 16 to r package arrow 17 and now get an error 
> when writing posixct to spark
> -
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Major
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> #> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> #> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> #> at java.lang.reflect.Method.invoke(Method.java:498)
> #> at sparklyr.Invoke.invoke(invoke.scala:147)
> #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
> #> at sparklyr.StreamHandler.read(stream.scala:61)
> #> at 
> sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
> #> at scala.util.control.Breaks.breakable(Breaks.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:14)
> #> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> #> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
> #> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360

[jira] [Resolved] (ARROW-8569) [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8569.

Resolution: Fixed

Issue resolved by pull request 7019
[https://github.com/apache/arrow/pull/7019]

> [CI] Upgrade xcode version for testing homebrew formulae
> 
>
> Key: ARROW-8569
> URL: https://issues.apache.org/jira/browse/ARROW-8569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Packaging
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> To prevent as many bottles from being built from source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions

2020-04-23 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-8065.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7000
[https://github.com/apache/arrow/pull/7000]

> [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
> -
>
> Key: ARROW-8065
> URL: https://issues.apache.org/jira/browse/ARROW-8065
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently: a fragment is a product of a scan; it is a lazy collection of scan 
> tasks corresponding to a data source which is logically singular (like a 
> single file, a single row group, ...). It would be more useful if instead a 
> fragment were the direct object of a scan; one scans a fragment (or a 
> collection of fragments):
>  # Remove {{ScanOptions}} from Fragment's properties and move it into 
> {{Fragment::Scan}} parameters.
>  # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an 
> overload to support predicate pushdown in FileSystemDataset and UnionDataset 
> {{Dataset::GetFragments(std::shared_ptr predicate)}}.
>  # Expose lazy accessor to Fragment::physical_schema()
>  # Consolidate ScanOptions and ScanContext
> This will lessen the cognitive dissonance between fragments and files since 
> fragments will no longer include references to scan properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-04-23 Thread Scott Wilson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090811#comment-17090811
 ] 

Scott Wilson commented on ARROW-8199:
-

Ah. That will be very cool. Thanks for your feedback. I’ll continue with
this approach, we’re moving our ML pipeline from python to c++, until yours
materializes.

On Thu, Apr 23, 2020 at 10:47 AM Wes McKinney (Jira) 

-- 
Sent from Gmail Mobile


> [C++] Guidance for creating multi-column sort on Table example?
> ---
>
> Key: ARROW-8199
> URL: https://issues.apache.org/jira/browse/ARROW-8199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Minor
>  Labels: c++, newbie
> Attachments: ArrowCsv.cpp
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples 
> ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure 
> that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where 
> you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original 
> Array values to the values in sorted order. (Original data is not sorted, but 
> could be as a second step.) I noticed some of the array list variants keep 
> offsets, but didn't see anything that supports remapping per a list of 
> indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8570) [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows)

2020-04-23 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-8570.
-
Resolution: Duplicate

> [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows)
> --
>
> Key: ARROW-8570
> URL: https://issues.apache.org/jira/browse/ARROW-8570
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Blocker
>
> See e.g. 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/32391335/job/ptbl9h9fffu0s5he
> {code}
>Creating library release\arrow_flight.lib and object 
> release\arrow_flight.exp
> absl_str_format_internal.lib(float_conversion.cc.obj) : error LNK2019: 
> unresolved external symbol __std_reverse_trivially_swappable_1 referenced in 
> function "void __cdecl std::_Reverse_unchecked1(char * const,char * 
> const,struct std::integral_constant)" 
> (??$_Reverse_unchecked1@PEAD@std@@YAXQEAD0U?$integral_constant@_K$00@0@@Z)
> absl_strings.lib(charconv_bigint.cc.obj) : error LNK2001: unresolved external 
> symbol __std_reverse_trivially_swappable_1
> release\arrow_flight.dll : fatal error LNK1120: 1 unresolved externals
> {code}
> This is probably an issue with a conda-forge package:
> https://github.com/conda-forge/grpc-cpp-feedstock/issues/58
> In the meantime we could pin {{grpc-cpp}} on your CI configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-8571.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7023
[https://github.com/apache/arrow/pull/7023]

> [C++] Switch AppVeyor image to VS 2017
> --
>
> Key: ARROW-8571
> URL: https://issues.apache.org/jira/browse/ARROW-8571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> conda-forge did the switch, so we should follow this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-04-23 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090795#comment-17090795
 ] 

Wes McKinney commented on ARROW-8199:
-

> Mainly I'd like to know if this looks like the direction you're thinking for 
> arrow::DataFrame?

No, to be honest from a glance it's a different direction from what I've been 
thinking. My thoughts there actually are for the data frame internally to be a 
mix of yet-to-be-scanned Datasets (e.g. from CSV or Parquet files), manifest 
(materialized in-memory) chunked arrays, and unevaluated expressions. Analytics 
requests translate requests into physical query plans to be executed by the 
to-be-developed query engine. I haven't been able to give this my full 
attention since writing the design docs last year but I intend to spend a large 
fraction of my time on it the rest of the year.

The reasoning for wanting to push data frame operations into a query engine is 
to get around the memory use issues and performance problems associated with 
"eager evaluation" data frame libraries like pandas (for example, a join in 
pandas materializes the entire joined data frame in memory). There are similar 
issues around sorting (particular with the knowledge of what you want to do 
with the sorted data -- e.g. sort followed by a slice can be executed as a 
Top-K operation for substantially less memory use)

That said, I know a number of people have expressed interest in having STL 
interface layers in Arrow to the data structures. This would be a valuable 
thing to contribute to the project. It's not mutually exclusive with the stuff 
I wrote above but wanted to give some idea 

> [C++] Guidance for creating multi-column sort on Table example?
> ---
>
> Key: ARROW-8199
> URL: https://issues.apache.org/jira/browse/ARROW-8199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Minor
>  Labels: c++, newbie
> Attachments: ArrowCsv.cpp
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples 
> ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure 
> that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where 
> you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original 
> Array values to the values in sorted order. (Original data is not sorted, but 
> could be as a second step.) I noticed some of the array list variants keep 
> offsets, but didn't see anything that supports remapping per a list of 
> indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-04-23 Thread Scott Wilson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Wilson updated ARROW-8199:

Attachment: ArrowCsv.cpp

Hi Wes,

I hope you and yours are staying healthy in this strange new world!

I've taken a stab at creating a DataFrame like cover for arrow::Table. My
first milestone was to see if I could come up with a df.eval() like
representation for single-line transforms -- see the EVAL2 macro. Attached
is my code, I'm not quite sure where, if anywhere, I should post it to get
your thoughts so I'm sending this email. (I posted an earlier version on
Jira Arrow-602.) Mainly I'd like to know if this looks like the direction
you're thinking for arrow::DataFrame?

Thanks, Scott

 Code, also included as attachment

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

using namespace std;
using namespace arrow;

// SBW 2020.04.15 For ArrayCoverRaw::iterator, we can simply use the the
pointer interface.
// Wes suggests returning std::optional, but sizeof(double) <
sizeof(std::optional) and
// is not a drop-in replacement for T, i.e. optional can't be used in
expression, need optional.value().

// STL container-like cover for arrow::Array.
// Only works for Array types that support raw_values().
template
class ArrayCoverRaw
{
public:
using T = typename ArrType::value_type;
using pointer = T*;
using const_pointer = const T*;
using reference = T&;
using const_reference = const T&;
// Match size_type to Array offsets rather than using size_t and ptrdiff_t.
using size_type = int64_t;
using difference_type = int64_t;
using iterator = pointer;
using const_iterator = const_pointer;
using reverse_iterator = pointer;
using const_reverse_iterator = const_pointer;

ArrayCoverRaw(std::shared_ptr& array) : _array(array) {}

size_type size() const { return _array->length(); }

// Should non-const versions fail if Array is immutable?
iterator begin() { return const_cast(_array->raw_values()); }
iterator end() { return
const_cast(_array->raw_values()+_array->length()); }
reverse_iterator rbegin() { return
const_cast(_array->raw_values()+_array->length()-1); }
reverse_iterator rend() { return
const_cast(_array->raw_values()-1); }
const_iterator cbegin() const { return _array->raw_values(); }
const_iterator cend() const { return _array->raw_values()+_array->length();
}
const_reverse_iterator crbegin() const { return
_array->raw_values()+_array->length()-1; }
const_reverse_iterator crend() const { return _array->raw_values()-1; }

// We could return std::optional to encapsulate IsNull() info, but this
would seem to break the expected semantics.
reference operator[](const difference_type off) {
assert(_array->data()->is_mutable()); return _array->raw_values()+off; }
const_reference operator[](const difference_type off) const { return
_array->raw_values()+off; }
// ISSUE: is there an interface for setting IsNull() if array is mutable.
bool IsNull(difference_type off) const { return _array->IsNull(off); }

protected:
std::shared_ptr _array;
};

// TODO: Add ArrayCoverString and iterators, perhaps others.

// Use template on RefType so we can create iterator and const_iterator by
changing Value.
// Use class specializations to support Arrays that don't have raw_values().
template 
class ChunkedArrayIterator
: public boost::iterator_facade,
RefType, boost::random_access_traversal_tag>
{
public:
using difference_type = int64_t;
using T = CType;
using ArrayType = typename CTypeTraits::ArrayType;
using pointer = T*;

explicit ChunkedArrayIterator(std::shared_ptr ch_arr =
0, difference_type pos = 0)
: _ch_arr(ch_arr)
{
set_position(pos);
}

bool IsNull() const
{
auto arr = _ch_arr->chunk(_chunk_index);
return arr->IsNull(_current-_first);
}

private:
friend class boost::iterator_core_access;

bool equal(ChunkedArrayIterator const& other) const
{
return this->_position == other._position;
}

void increment()
{
_position++;
// Need to move to next chunk?
if ((_current == _last) && ((_chunk_index+1) < _ch_arr->num_chunks()))
{
_chunk_index++;
auto arr = _ch_arr->chunk(_chunk_index);
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast(typed_arr->raw_values());
_last = _first + arr->length() - 1;
_current = _first;
}
else
{
_current++;
}
}

void decrement()
{
_position--;
// Need to move to previous chunk?
if ((_current == _first) && (_chunk_index > 0))
{
_chunk_index--;
auto arr = _ch_arr->chunk(_chunk_index);
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast(typed_arr->raw_values());
_last = _first + arr->length() - 1;
_current = _last;
}
else
{
_current--;
}
}

RefType& dereference() const { return *_current; }

void advance(difference_type n)
{
_position += n;
while (n > 0)
{
difference_type max_delta = _last - _current;
if ((max_delta >= n) || ((_chunk_index+

[jira] [Updated] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8571:
--
Labels: pull-request-available  (was: )

> [C++] Switch AppVeyor image to VS 2017
> --
>
> Key: ARROW-8571
> URL: https://issues.apache.org/jira/browse/ARROW-8571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> conda-forge did the switch, so we should follow this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn updated ARROW-8571:

Description: conda-forge did the switch, so we should follow this.

> [C++] Switch AppVeyor image to VS 2017
> --
>
> Key: ARROW-8571
> URL: https://issues.apache.org/jira/browse/ARROW-8571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>
> conda-forge did the switch, so we should follow this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017

2020-04-23 Thread Uwe Korn (Jira)

Uwe Korn created ARROW-8571:
---

 Summary: [C++] Switch AppVeyor image to VS 2017
 Key: ARROW-8571
 URL: https://issues.apache.org/jira/browse/ARROW-8571
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Uwe Korn
Assignee: Uwe Korn






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark

2020-04-23 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090772#comment-17090772
 ] 

Neal Richardson commented on ARROW-8566:


Hmm. Unfortunately, {{java.lang.UnsupportedOperationException}} doesn't tell me 
anything about what is unsupported.

The only thing about posixt types that changed in the last {{arrow}} release 
was a fix for ARROW-3543, specifically 
https://github.com/apache/arrow/commit/507762fa51d17e61f08d36d3626ab8b8df716198.
 I wonder, does it work if you explicitly set {{tz="GMT"}} on a POSIXct and 
send that?



> Upgraded from r package arrow 16 to r package arrow 17 and now get an error 
> when writing posixct to spark
> -
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Blocker
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> #> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> #> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> #> at java.lang.reflect.Method.invoke(Method.java:498)
> #> at sparklyr.Invoke.invoke(invoke.scala:147)
> #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
> #> at sparklyr.StreamHandler.read(stream.scala:61)
> #> at 
> sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
> #> at scala.util.control.Breaks.breakable(Breaks.scala:38)
> #> at sparklyr.BackendHandler

[jira] [Created] (ARROW-8570) [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows)

2020-04-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-8570:
-

 Summary: [CI] [C++] Link failure with AWS SDK on AppVeyor (Windows)
 Key: ARROW-8570
 URL: https://issues.apache.org/jira/browse/ARROW-8570
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


See e.g. 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/32391335/job/ptbl9h9fffu0s5he
{code}
   Creating library release\arrow_flight.lib and object release\arrow_flight.exp
absl_str_format_internal.lib(float_conversion.cc.obj) : error LNK2019: 
unresolved external symbol __std_reverse_trivially_swappable_1 referenced in 
function "void __cdecl std::_Reverse_unchecked1(char * const,char * 
const,struct std::integral_constant)" 
(??$_Reverse_unchecked1@PEAD@std@@YAXQEAD0U?$integral_constant@_K$00@0@@Z)
absl_strings.lib(charconv_bigint.cc.obj) : error LNK2001: unresolved external 
symbol __std_reverse_trivially_swappable_1
release\arrow_flight.dll : fatal error LNK1120: 1 unresolved externals
{code}

This is probably an issue with a conda-forge package:
https://github.com/conda-forge/grpc-cpp-feedstock/issues/58

In the meantime we could pin {{grpc-cpp}} on your CI configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics

2020-04-23 Thread Mayur Srivastava (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090742#comment-17090742
 ] 

Mayur Srivastava commented on ARROW-8562:
-

[~apitrou], [~lidavidm], I've created a PR for this work: 
[https://github.com/apache/arrow/pull/7022]

Please take a look when you get a chance.

 

Thanks,

Mayur

> [C++] IO: Parameterize I/O coalescing using S3 storage metrics
> --
>
> Key: ARROW-8562
> URL: https://issues.apache.org/jira/browse/ARROW-8562
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Related to https://issues.apache.org/jira/browse/ARROW-7995
> The adaptive I/O coalescing algorithm uses two parameters:
>  1. max_io_gap: Max I/O gap/hole size in bytes
>  2. ideal_request_size = Ideal I/O Request size in bytes
> These parameters can be derived from S3 metrics as described below:
> In an S3 compatible storage, there are two main metrics:
>  1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
> a new S3 request
>  2. Transfer Bandwidth (BW) for data in bytes/sec
> 1. Computing max_io_gap:
> max_io_gap = TTFB * BW
> This is also called Bandwidth-Delay-Product (BDP).
> Two byte ranges that have a gap can still be mapped to the same read if the 
> gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. 
> if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
> expected to be greater than just reading and discarding the extra bytes on an 
> existing HTTP request.
> 2. Computing ideal_request_size:
> We want to have high bandwidth utilization per S3 connections, i.e. transfer 
> large amounts of data to amortize the seek overhead.
>  But, we also want to leverage parallelism by slicing very large IO chunks. 
> We define two more config parameters with suggested default values to control 
> the slice size and seek to balance the two effects with the goal of 
> maximizing net data load performance.
> BW_util (ideal bandwidth utilization):
>  This means what fraction of per connection bandwidth should be utilized to 
> maximize net data load.
>  A good default value is 90% or 0.9.
> MAX_IDEAL_REQUEST_SIZE:
>  This means what is the maximum single request size (in bytes) to maximize 
> net data load.
>  A good default value is 64 MiB.
> The amount of data that needs to be transferred in a single S3 get_object 
> request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
>  eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
> Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
> following result:
>  ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
> Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
>  ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
> BW_util))
> The proposal is to create a named constructor in the io::CacheOptions (PR: 
> [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to 
> compute max_io_gap and ideal_request_size from TTFB and BW which will then be 
> passed to reader to configure the I/O coalescing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics

2020-04-23 Thread Mayur Srivastava (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090735#comment-17090735
 ] 

Mayur Srivastava commented on ARROW-8562:
-

I'm going to send a PR soon

> [C++] IO: Parameterize I/O coalescing using S3 storage metrics
> --
>
> Key: ARROW-8562
> URL: https://issues.apache.org/jira/browse/ARROW-8562
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Related to https://issues.apache.org/jira/browse/ARROW-7995
> The adaptive I/O coalescing algorithm uses two parameters:
>  1. max_io_gap: Max I/O gap/hole size in bytes
>  2. ideal_request_size = Ideal I/O Request size in bytes
> These parameters can be derived from S3 metrics as described below:
> In an S3 compatible storage, there are two main metrics:
>  1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
> a new S3 request
>  2. Transfer Bandwidth (BW) for data in bytes/sec
> 1. Computing max_io_gap:
> max_io_gap = TTFB * BW
> This is also called Bandwidth-Delay-Product (BDP).
> Two byte ranges that have a gap can still be mapped to the same read if the 
> gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. 
> if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
> expected to be greater than just reading and discarding the extra bytes on an 
> existing HTTP request.
> 2. Computing ideal_request_size:
> We want to have high bandwidth utilization per S3 connections, i.e. transfer 
> large amounts of data to amortize the seek overhead.
>  But, we also want to leverage parallelism by slicing very large IO chunks. 
> We define two more config parameters with suggested default values to control 
> the slice size and seek to balance the two effects with the goal of 
> maximizing net data load performance.
> BW_util (ideal bandwidth utilization):
>  This means what fraction of per connection bandwidth should be utilized to 
> maximize net data load.
>  A good default value is 90% or 0.9.
> MAX_IDEAL_REQUEST_SIZE:
>  This means what is the maximum single request size (in bytes) to maximize 
> net data load.
>  A good default value is 64 MiB.
> The amount of data that needs to be transferred in a single S3 get_object 
> request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
>  eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
> Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
> following result:
>  ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
> Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
>  ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
> BW_util))
> The proposal is to create a named constructor in the io::CacheOptions (PR: 
> [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to 
> compute max_io_gap and ideal_request_size from TTFB and BW which will then be 
> passed to reader to configure the I/O coalescing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8562:
--
Labels: pull-request-available  (was: )

> [C++] IO: Parameterize I/O coalescing using S3 storage metrics
> --
>
> Key: ARROW-8562
> URL: https://issues.apache.org/jira/browse/ARROW-8562
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Related to https://issues.apache.org/jira/browse/ARROW-7995
> The adaptive I/O coalescing algorithm uses two parameters:
>  1. max_io_gap: Max I/O gap/hole size in bytes
>  2. ideal_request_size = Ideal I/O Request size in bytes
> These parameters can be derived from S3 metrics as described below:
> In an S3 compatible storage, there are two main metrics:
>  1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
> a new S3 request
>  2. Transfer Bandwidth (BW) for data in bytes/sec
> 1. Computing max_io_gap:
> max_io_gap = TTFB * BW
> This is also called Bandwidth-Delay-Product (BDP).
> Two byte ranges that have a gap can still be mapped to the same read if the 
> gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. 
> if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
> expected to be greater than just reading and discarding the extra bytes on an 
> existing HTTP request.
> 2. Computing ideal_request_size:
> We want to have high bandwidth utilization per S3 connections, i.e. transfer 
> large amounts of data to amortize the seek overhead.
>  But, we also want to leverage parallelism by slicing very large IO chunks. 
> We define two more config parameters with suggested default values to control 
> the slice size and seek to balance the two effects with the goal of 
> maximizing net data load performance.
> BW_util (ideal bandwidth utilization):
>  This means what fraction of per connection bandwidth should be utilized to 
> maximize net data load.
>  A good default value is 90% or 0.9.
> MAX_IDEAL_REQUEST_SIZE:
>  This means what is the maximum single request size (in bytes) to maximize 
> net data load.
>  A good default value is 64 MiB.
> The amount of data that needs to be transferred in a single S3 get_object 
> request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
>  eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
> Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
> following result:
>  ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
> Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
>  ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
> BW_util))
> The proposal is to create a named constructor in the io::CacheOptions (PR: 
> [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to 
> compute max_io_gap and ideal_request_size from TTFB and BW which will then be 
> passed to reader to configure the I/O coalescing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark

2020-04-23 Thread Curt Bergmann (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090719#comment-17090719
 ] 

Curt Bergmann commented on ARROW-8566:
--

Yes, this fails every time. It is also reproduced on my colleague's machine.  

The failure is only with a posixct data type.  This data type did not fail in 
version 16. It seems to be associated with this in the traceback:

java.lang.UnsupportedOperationException
at 
org.apache.spark.sql.vectorized.ArrowColumnVector.(ArrowColumnVector.java:173)

> Upgraded from r package arrow 16 to r package arrow 17 and now get an error 
> when writing posixct to spark
> -
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Blocker
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> #> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> #> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> #> at java.lang.reflect.Method.invoke(Method.java:498)
> #> at sparklyr.Invoke.invoke(invoke.scala:147)
> #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
> #> at sparklyr.StreamHandler.read(stream.scala:61)
> #> at 
> sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
> #> at scala.util.control.Breaks.breakable(Breaks.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:38)
> #> at sparklyr.BackendHandle

[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-04-23 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090700#comment-17090700
 ] 

Antoine Pitrou commented on ARROW-3329:
---

Thank you [~jacek.pliszka] for doing this. The PR unexpected uncovered two 
issues: ARROW-8567 and ARROW-8568.

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-04-23 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-3329.
---
Resolution: Fixed

Issue resolved by pull request 6846
[https://github.com/apache/arrow/pull/6846]

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8569) [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8569:
--
Labels: pull-request-available  (was: )

> [CI] Upgrade xcode version for testing homebrew formulae
> 
>
> Key: ARROW-8569
> URL: https://issues.apache.org/jira/browse/ARROW-8569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Packaging
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To prevent as many bottles from being built from source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8569) [CI] Upgrade xcode version for testing homebrew formulae

2020-04-23 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-8569:
--

 Summary: [CI] Upgrade xcode version for testing homebrew formulae
 Key: ARROW-8569
 URL: https://issues.apache.org/jira/browse/ARROW-8569
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


To prevent as many bottles from being built from source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark

2020-04-23 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090673#comment-17090673
 ] 

Neal Richardson commented on ARROW-8566:


Is this consistently reproducible? Do any other data types cause issues? I 
can't tell from the spark traceback what is failing exactly.

> Upgraded from r package arrow 16 to r package arrow 17 and now get an error 
> when writing posixct to spark
> -
>
> Key: ARROW-8566
> URL: https://issues.apache.org/jira/browse/ARROW-8566
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: #> R version 3.6.3 (2020-02-29)
> #> Platform: x86_64-apple-darwin15.6.0 (64-bit)
> #> Running under: macOS Mojave 10.14.6
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
>Reporter: Curt Bergmann
>Priority: Blocker
>
> monospaced text}}``` r
> library(DBI)
> library(sparklyr)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> sc <- spark_connect(master = "local")
> sparklyr::spark_version(sc)
> #> [1] '2.4.5'
> x <- data.frame(y = Sys.time())
> dbWriteTable(sc, "test_posixct", x)
> #> Error: org.apache.spark.SparkException: Job aborted.
> #> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> #> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> #> at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
> #> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> #> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> #> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> #> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> #> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> #> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> #> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> #> at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> #> at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
> #> at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
> #> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> #> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> #> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> #> at java.lang.reflect.Method.invoke(Method.java:498)
> #> at sparklyr.Invoke.invoke(invoke.scala:147)
> #> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
> #> at sparklyr.StreamHandler.read(stream.scala:61)
> #> at 
> sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
> #> at scala.util.control.Breaks.breakable(Breaks.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:38)
> #> at sparklyr.BackendHandler.channelRead0(handler.scala:14)
> #> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> #> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChann

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Remi Dettai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090664#comment-17090664
 ] 

Remi Dettai commented on ARROW-8565:


Yes, but I could not manage to make it work with the static build of arrow and 
I don't want to use the shared lib of arrow as it generates a set of binaries 
that is too large for my usecase (embeding into aws lambda with optimized 
coldstart)

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8560) [Rust] Docs for MutableBuffer resize are incorrect

2020-04-23 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8560.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7015
[https://github.com/apache/arrow/pull/7015]

> [Rust] Docs for MutableBuffer resize are incorrect
> --
>
> Key: ARROW-8560
> URL: https://issues.apache.org/jira/browse/ARROW-8560
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8532) [C++][CSV] Add support for sentinel values.

2020-04-23 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090648#comment-17090648
 ] 

Francois Saint-Jacques commented on ARROW-8532:
---

This is a duplicate of ARROW-8348?

> [C++][CSV] Add support for sentinel values.
> ---
>
> Key: ARROW-8532
> URL: https://issues.apache.org/jira/browse/ARROW-8532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ravil Bikbulatov
>Priority: Major
>
> Some systems still use sentinel values to store nulls. It would be good if 
> read_csv would place sentinel values and user wouldn't need to convet null 
> bitmaps to sentinel values.
> Adding this support doesn't contradict Arrow specification as null values are 
> undefined. Also it wouldn't add any overhead to read_csv. Since Arrow is 
> general purpose framework I think we can relieve users from pain of 
> converting bitmats to sentinel values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8348) [C++] Support optional sentinel values in primitive Array for nulls

2020-04-23 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090647#comment-17090647
 ] 

Francois Saint-Jacques commented on ARROW-8348:
---

I'm not proposing this as a format change, just a C++ interface niceties. It 
could be done with Metadata, but string typing is a pain.

> [C++] Support optional sentinel values in primitive Array for nulls
> ---
>
> Key: ARROW-8348
> URL: https://issues.apache.org/jira/browse/ARROW-8348
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This is an optional feature where a sentinel value is stored in null cells 
> and is exposed via an accessor method, e.g. `optional 
> Array::HasSentinel() const;`. This would allow zero-copy bi-directional 
> conversion with R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090637#comment-17090637
 ] 

Francois Saint-Jacques commented on ARROW-8565:
---

Have you tried with a shared build of aws sdk?

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8568) [C++][Python] Crash on decimal cast in debug mode

2020-04-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-8568:
-

 Summary: [C++][Python] Crash on decimal cast in debug mode
 Key: ARROW-8568
 URL: https://issues.apache.org/jira/browse/ARROW-8568
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.17.0
Reporter: Antoine Pitrou


{code:python}
>>> arr = pa.array([Decimal('123.45')]) 
>>> 
>>>   
>>> arr 
>>> 
>>>   

[
  123.45
]
>>> arr.type
>>> 
>>>   
Decimal128Type(decimal(5, 2))
>>> arr.cast(pa.decimal128(4, 2))   
>>> 
>>>   
../src/arrow/util/basic_decimal.cc:626:  Check failed: (original_scale) != 
(new_scale) 
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8563) [Go] Minor change to make newBuilder public

2020-04-23 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8563:

Summary: [Go] Minor change to make newBuilder public  (was: Minor change to 
make newBuilder public)

> [Go] Minor change to make newBuilder public
> ---
>
> Key: ARROW-8563
> URL: https://issues.apache.org/jira/browse/ARROW-8563
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Amol Umbarkar
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This minor change makes newBuilder() public to reduce verbosity for 
> downstream.
> To give you example, I am working on a parquet read / write into Arrow Record 
> batch where the parquet data types are mapped to Arrow data types.
> My Repo: [https://github.com/mindhash/arrow-parquet-go]
> In such cases, it will be nice to have a builder API (newBuilder) be generic 
> to accept a data type and return a respective array. 
> I am looking at a similar situation for JSON reader. I think this change will 
> make the builder API much easier for upstream as well as internal packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8532) [C++][CSV] Add support for sentinel values.

2020-04-23 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090615#comment-17090615
 ] 

Wes McKinney commented on ARROW-8532:
-

If you want to propose a configuration option to place some non-zero value in 
the null slots, please feel free to do so. 

> [C++][CSV] Add support for sentinel values.
> ---
>
> Key: ARROW-8532
> URL: https://issues.apache.org/jira/browse/ARROW-8532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ravil Bikbulatov
>Priority: Major
>
> Some systems still use sentinel values to store nulls. It would be good if 
> read_csv would place sentinel values and user wouldn't need to convet null 
> bitmaps to sentinel values.
> Adding this support doesn't contradict Arrow specification as null values are 
> undefined. Also it wouldn't add any overhead to read_csv. Since Arrow is 
> general purpose framework I think we can relieve users from pain of 
> converting bitmats to sentinel values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Remi Dettai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090605#comment-17090605
 ] 

Remi Dettai edited comment on ARROW-8565 at 4/23/20, 1:14 PM:
--

That's what I am trying to do. I have a static build of the sdk that is 
correctly picked up (_${AWSSDK_LINK_LIBRARIES}_ seems defined) but when i try 
to link to it in an example with

{{target_link_libraries(example PRIVATE parquet_static 
${AWSSDK_LINK_LIBRARIES})}}

I keep getting awefull _undefined reference to `Aws::XXX`_ errors :(


was (Author: rdettai):
That what I am trying to do. I have a static build of the sdk that is correctly 
picked up (_${AWSSDK_LINK_LIBRARIES}_ seems defined) but when i try to link to 
it in an example with

{{target_link_libraries(example PRIVATE parquet_static 
${AWSSDK_LINK_LIBRARIES})}}

I keep getting awefull _undefined reference to `Aws::XXX`_ errors :(

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Remi Dettai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090605#comment-17090605
 ] 

Remi Dettai commented on ARROW-8565:


That what I am trying to do. I have a static build of the sdk that is correctly 
picked up (_${AWSSDK_LINK_LIBRARIES}_ seems defined) but when i try to link to 
it in an example with

{{target_link_libraries(example PRIVATE parquet_static 
${AWSSDK_LINK_LIBRARIES})}}

I keep getting awefull _undefined reference to `Aws::XXX`_ errors :(

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8567) [Python] pa.array() sometimes ignore "safe=False"

2020-04-23 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-8567:
-

 Summary: [Python] pa.array() sometimes ignore "safe=False"
 Key: ARROW-8567
 URL: https://issues.apache.org/jira/browse/ARROW-8567
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.0
Reporter: Antoine Pitrou


Generally, {{pa.array(data).cast(sometype, safe=...)}} is equivalent to
{{pa.array(data, sometype, safe=...)}}. Consider the following:

{code:python}
>>> pa.array([Decimal('12.34')]).cast(pa.int32(), safe=False)   
>>> 
>>> 

[
  12
]
>>> pa.array([Decimal('12.34')], pa.int32(), safe=False)
>>> 
>>> 

[
  12
]
{code}

However, that is not always the case:
{code:python}
>>> pa.array([Decimal('1234')]).cast(pa.int8(), safe=False) 
>>> 
>>> 

[
  -46
]
>>> pa.array([Decimal('1234')], pa.int8(), safe=False)  
>>> 
>>> 
Traceback (most recent call last):
  ...
ArrowInvalid: Value 1234 too large to fit in C integer type
{code}

I don't think this is very important: first because you can call cast() 
directly, second because the results are unusable anyway.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8497) [Archery] Add missing component to builds

2020-04-23 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8497:
-

Assignee: Francois Saint-Jacques

> [Archery] Add missing component to builds
> -
>
> Key: ARROW-8497
> URL: https://issues.apache.org/jira/browse/ARROW-8497
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Developer Tools
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8497) [Archery] Add missing component to builds

2020-04-23 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8497.
---
Resolution: Fixed

Issue resolved by pull request 6966
[https://github.com/apache/arrow/pull/6966]

> [Archery] Add missing component to builds
> -
>
> Key: ARROW-8497
> URL: https://issues.apache.org/jira/browse/ARROW-8497
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Developer Tools
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8566) Upgraded from r package arrow 16 to r package arrow 17 and now get an error when writing posixct to spark

2020-04-23 Thread Curt Bergmann (Jira)

Curt Bergmann created ARROW-8566:


 Summary: Upgraded from r package arrow 16 to r package arrow 17 
and now get an error when writing posixct to spark
 Key: ARROW-8566
 URL: https://issues.apache.org/jira/browse/ARROW-8566
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.17.0
 Environment: #> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
sparklyr::spark_version(sc)
#> [1] '2.4.5'
Reporter: Curt Bergmann


monospaced text}}``` r
library(DBI)
library(sparklyr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp
sc <- spark_connect(master = "local")
sparklyr::spark_version(sc)
#> [1] '2.4.5'
x <- data.frame(y = Sys.time())
dbWriteTable(sc, "test_posixct", x)
#> Error: org.apache.spark.SparkException: Job aborted.
#> at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
#> at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
#> at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:503)
#> at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
#> at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
#> at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
#> at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
#> at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
#> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
#> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
#> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
#> at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
#> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
#> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
#> at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
#> at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
#> at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
#> at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
#> at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
#> at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
#> at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
#> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
#> at 
org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
#> at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
#> at 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
#> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
#> at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
#> at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
#> at java.lang.reflect.Method.invoke(Method.java:498)
#> at sparklyr.Invoke.invoke(invoke.scala:147)
#> at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
#> at sparklyr.StreamHandler.read(stream.scala:61)
#> at 
sparklyr.BackendHandler$$anonfun$channelRead0$1.apply$mcV$sp(handler.scala:58)
#> at scala.util.control.Breaks.breakable(Breaks.scala:38)
#> at sparklyr.BackendHandler.channelRead0(handler.scala:38)
#> at sparklyr.BackendHandler.channelRead0(handler.scala:14)
#> at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
#> at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
#> at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
#> at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
#> at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
#> at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
#> at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
#> at 
io.netty.channel.AbstractChann

[jira] [Commented] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-23 Thread Neville Dipale (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090527#comment-17090527
 ] 

Neville Dipale commented on ARROW-8536:
---

Done, please take a look

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Assignee: Neville Dipale
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a "format" directory in the root of your file 
> system (or at least at a higher level than where cargo is building code) and 
> place the Flight.proto file there (making sure to use the 0.17.0 version, 
> which can be found in the source release [1]).
>  [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-23 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-8536:
-

Assignee: Neville Dipale  (was: Andy Grove)

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Assignee: Neville Dipale
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a "format" directory in the root of your file 
> system (or at least at a higher level than where cargo is building code) and 
> place the Flight.proto file there (making sure to use the 0.17.0 version, 
> which can be found in the source release [1]).
>  [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8536:
--
Labels: pull-request-available  (was: )

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a "format" directory in the root of your file 
> system (or at least at a higher level than where cargo is building code) and 
> place the Flight.proto file there (making sure to use the 0.17.0 version, 
> which can be found in the source release [1]).
>  [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics

2020-04-23 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090509#comment-17090509
 ] 

Antoine Pitrou commented on ARROW-8562:
---

> The proposal is to create a named constructor in the io::CacheOptions (PR: 
> https://github.com/apache/arrow/pull/6744 created by David Li) to compute 
> max_io_gap and ideal_request_size from TTFB and BW which will then be passed 
> to reader to configure the I/O coalescing.

This sounds like a good idea in principle. Can you submit a PR?

> [C++] IO: Parameterize I/O coalescing using S3 storage metrics
> --
>
> Key: ARROW-8562
> URL: https://issues.apache.org/jira/browse/ARROW-8562
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>
> Related to https://issues.apache.org/jira/browse/ARROW-7995
> The adaptive I/O coalescing algorithm uses two parameters:
>  1. max_io_gap: Max I/O gap/hole size in bytes
>  2. ideal_request_size = Ideal I/O Request size in bytes
> These parameters can be derived from S3 metrics as described below:
> In an S3 compatible storage, there are two main metrics:
>  1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
> a new S3 request
>  2. Transfer Bandwidth (BW) for data in bytes/sec
> 1. Computing max_io_gap:
> max_io_gap = TTFB * BW
> This is also called Bandwidth-Delay-Product (BDP).
> Two byte ranges that have a gap can still be mapped to the same read if the 
> gap is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. 
> if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
> expected to be greater than just reading and discarding the extra bytes on an 
> existing HTTP request.
> 2. Computing ideal_request_size:
> We want to have high bandwidth utilization per S3 connections, i.e. transfer 
> large amounts of data to amortize the seek overhead.
>  But, we also want to leverage parallelism by slicing very large IO chunks. 
> We define two more config parameters with suggested default values to control 
> the slice size and seek to balance the two effects with the goal of 
> maximizing net data load performance.
> BW_util (ideal bandwidth utilization):
>  This means what fraction of per connection bandwidth should be utilized to 
> maximize net data load.
>  A good default value is 90% or 0.9.
> MAX_IDEAL_REQUEST_SIZE:
>  This means what is the maximum single request size (in bytes) to maximize 
> net data load.
>  A good default value is 64 MiB.
> The amount of data that needs to be transferred in a single S3 get_object 
> request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
>  eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
> Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
> following result:
>  ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
> Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
>  ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
> BW_util))
> The proposal is to create a named constructor in the io::CacheOptions (PR: 
> [https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to 
> compute max_io_gap and ideal_request_size from TTFB and BW which will then be 
> passed to reader to configure the I/O coalescing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090499#comment-17090499
 ] 

Antoine Pitrou commented on ARROW-8565:
---

The solution is to build and install the AWS SDK separately, or to use 
pre-built binaries. Hopefully CMake will be able to pick them up, if they are 
installed in the right place.


> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Remi Dettai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090493#comment-17090493
 ] 

Remi Dettai commented on ARROW-8565:


Thanks ! With my mediocrity in understanding cmake, I'm definitively not the 
man for the job... ;) As a workaround, is it possible to build statically from 
arrow and only maintain a shared dependency on the SDK ?

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090463#comment-17090463
 ] 

Antoine Pitrou commented on ARROW-8565:
---

As far as I remember, this means that the AWS SDK build procedure will by 
default compile its own version of OpenSSL.
I would say it's probably fixable, but you will have to find out how and submit 
a PR for it :-)

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-23 Thread Remi Dettai (Jira)

Remi Dettai created ARROW-8565:
--

 Summary: [C++] Static build with AWS SDK
 Key: ARROW-8565
 URL: https://issues.apache.org/jira/browse/ARROW-8565
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Remi Dettai


I can't find my way around the build system when using the S3 client.

It seems that only shared target is allowed when the S3 feature is ON. In the 
thirdparty toolchain, when printing:

??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
libcrypto"??

What is actually meant is that static build will not work, correct ? If it is 
the case, should libarrow.a be generated at all when S3 feature is on ? 

What can be done to fix this ? What does it mean that the SDK links to the 
wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
but maintain a dynamic link to a shared version of the SDK ?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8564) [Website] Add Ubuntu 20.04 LTS to supported package list

2020-04-23 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-8564:
---

 Summary: [Website] Add Ubuntu 20.04 LTS to supported package list
 Key: ARROW-8564
 URL: https://issues.apache.org/jira/browse/ARROW-8564
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet

2020-04-23 Thread Jacek Pliszka (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090391#comment-17090391
 ] 

Jacek Pliszka commented on ARROW-8545:
--

Int to decimal is not implement either.

 

Thought it is much simpler than float to decimal as no rounding handling is 
needed (we have no negative scale as the moment)

 

If no one steps up before May I may have some time then to do it.

 

It is similar to what we did with decimal to decimal and decimal to int casting

 

https://issues.apache.org/jira/browse/ARROW-3329

> [Python] Allow fast writing of Decimal column to parquet
> 
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet

2020-04-23 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090385#comment-17090385
 ] 

Joris Van den Bossche commented on ARROW-8545:
--

As [~jacek.pliszka] says, there are indeed two different parts to this: 1) 
converting pandas dataframe with decimal objects to arrow Table, and 2) writing 
the arrow table to parquet. 
>From a quick timing, the slowdown you see with decimals is almost entirely due 
>to step 1 (so not related to writing parquet itself).

Using the same dataframe creation as your code above (only using 10x less data 
to easily fit on my laptop):
{code:python}
...
df1 = pd.DataFrame(d)  
# second dataframe with the decimal column
df2 = df.copy() 
df2["a"] = df2["a"].round(decimals=3).astype(str).map(decimal.Decimal) 

# convert each of them to a pyarrow.Table
table1 = pa.table(df1)
table2 = pa.table(df2)  
{code}

Timing the conversion to pyarrow.Table:
{code}
In [13]: %timeit pa.table(df1)  

   
32 ms ± 7.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %timeit pa.table(df2)  

   
1.54 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

and then timing the writing of the pyarrow.Table to parquet:
{code}

In [16]: import pyarrow.parquet as pq  

In [17]: %timeit pq.write_table(table1, "/tmp/testabc.parquet")   
710 ms ± 29.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit pq.write_table(table2, "/tmp/testabc.parquet")  
750 ms ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

and timing {{to_parquet()}} more or less gives the sum of the two steps above:

{code}
In [20]: %timeit df1.to_parquet("/tmp/testabc.pq", engine="pyarrow")   
793 ms ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: %timeit df2.to_parquet("/tmp/testabc.pq", engine="pyarrow")  
2.01 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}


So you can see here that the actual writing to parquet is only slightly slower, 
and that the large slowdown comes from converting the python Decimal objects to 
a pyarrow decimal column.

Of course, when starting from a pandas DataFrame to write to parquet, this 
conversion of pandas to pyarrow.Table is part of the overall process. But, to 
improve this, I think the only solution is to _not_ use python 
{{decimal.Decimal}} objects in an object-dtype column.  
Some options for this:

* Do the casting to decimal on the pyarrow side. However, as [~jacek.pliszka] 
linked, this is not yet implemented for floats (ARROW-8557). I am not directly 
sure if other conversion are possible right now in pyarrow (like converting as 
ints and convert those to decimal with a factor).
* Use a pandas ExtensionDtype to store decimals in a pandas DataFrame 
differently (not as python objects). I am not aware of an existing project that 
already does this (except for Fletcher, which experiments with storing arrow 
types in pandas dataframes in general).

It might be that this python Decimal object -> pyarrow decimal array conversion 
is not fully optimized, however, since it involves dealing with a numpy array 
of python objects, it will never be as fast as converting a numpy float array 
to pyarrow.




> [Python] Allow fast writing of Decimal column to parquet
> 
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object

[jira] [Commented] (ARROW-8455) [Rust] [Parquet] Arrow column read on partially compatible files

2020-04-23 Thread Remi Dettai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090358#comment-17090358
 ] 

Remi Dettai commented on ARROW-8455:


[~csun] can you take a look at this ?

> [Rust] [Parquet] Arrow column read on partially compatible files
> 
>
> Key: ARROW-8455
> URL: https://issues.apache.org/jira/browse/ARROW-8455
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Seen behavior: When reading a Parquet file into Arrow with 
> `get_record_reader_by_columns`, it will fail if one of the column of the file 
> is a list (or any other unsupported type).
> Expected behavior: it should only fail if you are actually reading the column 
> with unsuported type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

84 matches

Mail list logo