[jira] [Updated] (ARROW-8552) [Rust] support column iteration for parquet row

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8552:
--
Labels: pull-request-available  (was: )

> [Rust] support column iteration for parquet row
> ---
>
> Key: ARROW-8552
> URL: https://issues.apache.org/jira/browse/ARROW-8552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: QP Hou
>Priority: Minor
>  Labels: pull-request-available
>
> It would be useful to be able to iterate through all the columns in a parquet 
> row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8552) [Rust] support column iteration for parquet row

2020-04-21 Thread QP Hou (Jira)
QP Hou created ARROW-8552:
-

 Summary: [Rust] support column iteration for parquet row
 Key: ARROW-8552
 URL: https://issues.apache.org/jira/browse/ARROW-8552
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: QP Hou


It would be useful to be able to iterate through all the columns in a parquet 
row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8551) [CI][Gandiva] Use docker image with LLVM 8 to build gandiva linux jar

2020-04-21 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8551:
---

 Summary: [CI][Gandiva] Use docker image with LLVM 8 to build 
gandiva linux jar
 Key: ARROW-8551
 URL: https://issues.apache.org/jira/browse/ARROW-8551
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8551) [CI][Gandiva] Use LLVM 8 to build gandiva linux jar

2020-04-21 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-8551:

Summary: [CI][Gandiva] Use LLVM 8 to build gandiva linux jar  (was: 
[CI][Gandiva] Use docker image with LLVM 8 to build gandiva linux jar)

> [CI][Gandiva] Use LLVM 8 to build gandiva linux jar
> ---
>
> Key: ARROW-8551
> URL: https://issues.apache.org/jira/browse/ARROW-8551
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8551) [CI][Gandiva] Use LLVM 8 to build gandiva linux jar

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8551:
--
Labels: pull-request-available  (was: )

> [CI][Gandiva] Use LLVM 8 to build gandiva linux jar
> ---
>
> Key: ARROW-8551
> URL: https://issues.apache.org/jira/browse/ARROW-8551
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8528) [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing

2020-04-21 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla closed ARROW-8528.
---
Resolution: Cannot Reproduce

> [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing
> --
>
> Key: ARROW-8528
> URL: https://issues.apache.org/jira/browse/ARROW-8528
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8528) [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing

2020-04-21 Thread Prudhvi Porandla (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089246#comment-17089246
 ] 

Prudhvi Porandla commented on ARROW-8528:
-

resolved with [https://github.com/Homebrew/homebrew-core/pull/53445/files]

> [CI][NIGHTLY:gandiva-jar-osx] gandiva osx build is failing
> --
>
> Key: ARROW-8528
> URL: https://issues.apache.org/jira/browse/ARROW-8528
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8531) [C++] Deprecate ARROW_USE_SIMD CMake option

2020-04-21 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089225#comment-17089225
 ] 

Yibo Cai commented on ARROW-8531:
-

ARROW_USE_SIMD removed in https://github.com/apache/arrow/pull/6954

> [C++] Deprecate ARROW_USE_SIMD CMake option
> ---
>
> Key: ARROW-8531
> URL: https://issues.apache.org/jira/browse/ARROW-8531
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is superseded by the {{ARROW_SIMD_LEVEL}} option



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8550) [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8550.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7005
[https://github.com/apache/arrow/pull/7005]

> [CI] Don't run cron GHA jobs on forks
> -
>
> Key: ARROW-8550
> URL: https://issues.apache.org/jira/browse/ARROW-8550
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> It's wasteful, and I'm tired of seeing them clogging up my Actions tab and 
> notifications. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets

2020-04-21 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089198#comment-17089198
 ] 

Mark Hildreth commented on ARROW-8508:
--

I believe there are a few things going on here:

1.) I wouldn't consider myself an expert on these APIs, but it seems like the 
builders are being used correctly.

2.) The debug output definitely appears broken. I opened a [PR to fix 
this|https://github.com/apache/arrow/pull/7006], which puts it more in line 
with how the non-fixed size *ListArray* works. This should fix the *value()* 
method on the FixedSizeListArray to properly take the offset into the child 
array into account.

3.) As for the asserts that fail, this I'm less certain on. The values from 
these asserts are taken from the *values()* method, which seems to just return 
the underlying array without taking offsets into account. This seems to be 
similar to how other arrays work (including primitives), so my guess it is by 
design. I don't have an explanation for a better way of using the API, so maybe 
someone else can provide input.

> [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
> 
>
> Key: ARROW-8508
> URL: https://issues.apache.org/jira/browse/ARROW-8508
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Christian Beilschmidt
>Priority: Major
>  Labels: pull-request-available
>
> I created an example of storing multi points with Arrow.
>  # A coordinate consists of two floats (Float64Builder)
>  # A multi point consists of one or more coordinates (FixedSizeListBuilder)
>  # A list of multi points consists of multiple multi points (ListBuilder)
> This is the corresponding code snippet:
> {code:java}
> let float_builder = arrow::array::Float64Builder::new(0);
> let coordinate_builder = 
> arrow::array::FixedSizeListBuilder::new(float_builder, 2);
> let mut multi_point_builder = 
> arrow::array::ListBuilder::new(coordinate_builder);
> multi_point_builder
> .values()
> .values()
> .append_slice(&[0.0, 0.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[1.0, 1.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // first multi point
> multi_point_builder
> .values()
> .values()
> .append_slice(&[2.0, 2.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[3.0, 3.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[4.0, 4.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // second multi point
> let multi_point = dbg!(multi_point_builder.finish());
> let first_multi_point_ref = multi_point.value(0);
> let first_multi_point: ::array::FixedSizeListArray = 
> first_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = first_multi_point.values();
> let coordinates:  = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]);
> let second_multi_point_ref = multi_point.value(1);
> let second_multi_point: ::array::FixedSizeListArray = 
> second_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = second_multi_point.values();
> let coordinates:  = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, 
> 4.1]);
> {code}
> The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, 
> 2.1]}}.
> Moreover, the debug output produced from {{dbg!}} confirms this:
> {noformat}
> [
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
> ],
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
>   PrimitiveArray
> [
>   2.0,
>   2.1,
> ],
> ],
> ]{noformat}
> The second list should contain the values 2-4.
>  
> So either I am using the builder wrong or there is a bug with the offsets. I 
> used {{0.16}} as well as the current {{master}} from GitHub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8537) [C++] Performance regression from ARROW-8523

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8537:
--
Labels: pull-request-available  (was: )

> [C++] Performance regression from ARROW-8523
> 
>
> Key: ARROW-8537
> URL: https://issues.apache.org/jira/browse/ARROW-8537
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
>
> I optimized BitmapReader in [this 
> PR|https://github.com/apache/arrow/pull/6986] and see performance uplift of 
> BitmapReader test case. I didn't check other test cases as this change looks 
> trivial.
> I reviewed all test cases just now and see big performance drop of 4 cases, 
> details at [PR 
> link|https://github.com/apache/arrow/pull/6986#issuecomment-616915079].
> I also compared performance of code using BitmapReader, no obvious changes 
> found. Looks we should revert that PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8508:
--
Labels: pull-request-available  (was: )

> [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
> 
>
> Key: ARROW-8508
> URL: https://issues.apache.org/jira/browse/ARROW-8508
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Christian Beilschmidt
>Priority: Major
>  Labels: pull-request-available
>
> I created an example of storing multi points with Arrow.
>  # A coordinate consists of two floats (Float64Builder)
>  # A multi point consists of one or more coordinates (FixedSizeListBuilder)
>  # A list of multi points consists of multiple multi points (ListBuilder)
> This is the corresponding code snippet:
> {code:java}
> let float_builder = arrow::array::Float64Builder::new(0);
> let coordinate_builder = 
> arrow::array::FixedSizeListBuilder::new(float_builder, 2);
> let mut multi_point_builder = 
> arrow::array::ListBuilder::new(coordinate_builder);
> multi_point_builder
> .values()
> .values()
> .append_slice(&[0.0, 0.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[1.0, 1.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // first multi point
> multi_point_builder
> .values()
> .values()
> .append_slice(&[2.0, 2.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[3.0, 3.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[4.0, 4.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // second multi point
> let multi_point = dbg!(multi_point_builder.finish());
> let first_multi_point_ref = multi_point.value(0);
> let first_multi_point: ::array::FixedSizeListArray = 
> first_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = first_multi_point.values();
> let coordinates:  = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]);
> let second_multi_point_ref = multi_point.value(1);
> let second_multi_point: ::array::FixedSizeListArray = 
> second_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = second_multi_point.values();
> let coordinates:  = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, 
> 4.1]);
> {code}
> The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, 
> 2.1]}}.
> Moreover, the debug output produced from {{dbg!}} confirms this:
> {noformat}
> [
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
> ],
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
>   PrimitiveArray
> [
>   2.0,
>   2.1,
> ],
> ],
> ]{noformat}
> The second list should contain the values 2-4.
>  
> So either I am using the builder wrong or there is a bug with the offsets. I 
> used {{0.16}} as well as the current {{master}} from GitHub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8550) [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089125#comment-17089125
 ] 

Antoine Pitrou commented on ARROW-8550:
---

I don't really agree, I use them from time to time.

> [CI] Don't run cron GHA jobs on forks
> -
>
> Key: ARROW-8550
> URL: https://issues.apache.org/jira/browse/ARROW-8550
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> It's wasteful, and I'm tired of seeing them clogging up my Actions tab and 
> notifications. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8512) [C++] Delete unused compute expr prototype code

2020-04-21 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8512:
---

Assignee: Wes McKinney

> [C++] Delete unused compute expr prototype code
> ---
>
> Key: ARROW-8512
> URL: https://issues.apache.org/jira/browse/ARROW-8512
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Most of the code added in 
> https://github.com/apache/arrow/commit/08ca13f83f3d6dbd818c4280d619dae306aa9de5
>  can be deleted. I may leave some of the "shape" types in case we can make 
> use of those. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8550) [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8550:
--

 Summary: [CI] Don't run cron GHA jobs on forks
 Key: ARROW-8550
 URL: https://issues.apache.org/jira/browse/ARROW-8550
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson


It's wasteful, and I'm tired of seeing them clogging up my Actions tab and 
notifications. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8550) [CI] Don't run cron GHA jobs on forks

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8550:
--
Labels: pull-request-available  (was: )

> [CI] Don't run cron GHA jobs on forks
> -
>
> Key: ARROW-8550
> URL: https://issues.apache.org/jira/browse/ARROW-8550
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> It's wasteful, and I'm tired of seeing them clogging up my Actions tab and 
> notifications. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8549) [R] Assorted post-0.17 release cleanups

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8549:
--
Labels: pull-request-available  (was: )

> [R] Assorted post-0.17 release cleanups
> ---
>
> Key: ARROW-8549
> URL: https://issues.apache.org/jira/browse/ARROW-8549
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8549) [R] Assorted post-0.17 release cleanups

2020-04-21 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8549:
--

 Summary: [R] Assorted post-0.17 release cleanups
 Key: ARROW-8549
 URL: https://issues.apache.org/jira/browse/ARROW-8549
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7011) [C++] Implement casts from float/double to decimal128

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089080#comment-17089080
 ] 

Jacek Pliszka commented on ARROW-7011:
--

Would it be enough to:

 

multiply by 10**scale, cast to integer, cast to decimal128, corect the scale ?

> [C++] Implement casts from float/double to decimal128
> -
>
> Key: ARROW-7011
> URL: https://issues.apache.org/jira/browse/ARROW-7011
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> see also ARROW-5905, ARROW-7010



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8548) [Website] 0.17 release post

2020-04-21 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8548:
--

 Summary: [Website] 0.17 release post
 Key: ARROW-8548
 URL: https://issues.apache.org/jira/browse/ARROW-8548
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089069#comment-17089069
 ] 

Jacek Pliszka commented on ARROW-8545:
--

Cast from float should allow quick conversion

> [Python] Allow fast writing of Decimal column to parquet
> 
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8542) [Release] Fix checksum url in the website post release script

2020-04-21 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8542.
-
Resolution: Fixed

Issue resolved by pull request 6999
[https://github.com/apache/arrow/pull/6999]

> [Release] Fix checksum url in the website post release script
> -
>
> Key: ARROW-8542
> URL: https://issues.apache.org/jira/browse/ARROW-8542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The issue was captured here 
> https://github.com/apache/arrow-site/pull/53#discussion_r411728907



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8541) [Release] Don't remove previous source releases automatically

2020-04-21 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089058#comment-17089058
 ] 

Kouhei Sutou commented on ARROW-8541:
-

Wow! I didn't know the archive site.

> [Release] Don't remove previous source releases automatically
> -
>
> Key: ARROW-8541
> URL: https://issues.apache.org/jira/browse/ARROW-8541
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We should keep at least the last three source tarballs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8539:

Fix Version/s: 1.0.0

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Assignee: Yosuke Shiro
>Priority: Critical
> Fix For: 1.0.0
>
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8545) [Python] Allow fast writing of Decimal column to parquet

2020-04-21 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8545:
---
Summary: [Python] Allow fast writing of Decimal column to parquet  (was: 
Allow fast writing of Decimal column to parquet)

> [Python] Allow fast writing of Decimal column to parquet
> 
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8547) [Rust] Implement JsonEqual for UnionArray

2020-04-21 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8547:
--

 Summary: [Rust] Implement JsonEqual for UnionArray
 Key: ARROW-8547
 URL: https://issues.apache.org/jira/browse/ARROW-8547
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8546) [Rust] Handle UnionArray in get_fb_field_type

2020-04-21 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8546:
--

 Summary: [Rust] Handle UnionArray in get_fb_field_type
 Key: ARROW-8546
 URL: https://issues.apache.org/jira/browse/ARROW-8546
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089011#comment-17089011
 ] 

Jacek Pliszka edited comment on ARROW-8545 at 4/21/20, 7:54 PM:


OK, I checked and This is my version:

 
{code:java}
pat = pa.Table.from_pandas(df)
t3 = time()
print(t3-t2)
pat = pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3)))
t4 = time()
print(t4 - t3)
pq.write_table(pat, '/tmp/testabd.pq')
t5 = time()
print(t5 - t4)
{code}
And we are getting here

A) 0.3s for conversion from pandas to arrow Table

B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast 
implemented from double to decimal(38, 3)

C) 2.8s for writing table to parquet file - is it fast enough for you

 

B and C are separate topics and should have separate issues. In B decimal128 
should be easier if this is enough for you

 

 


was (Author: jacek.pliszka):
OK, I checked and This is my version:

 
{code:java}
pat = pa.Table.from_pandas(df)
t3 = time()
print(t3-t2)
pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3)))
t4 = time()
print(t4 - t3)
pq.write_table(pat, '/tmp/testabd.pq')
t5 = time()
print(t5 - t4)
{code}
And we are getting here

A) 0.3s for conversion from pandas to arrow Table

B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast 
implemented from double to decimal(38, 3)

C) 2.8s for writing table to parquet file - is it fast enough for you

 

B and C are separate topics and should have separate issues. In B decimal128 
should be easier if this is enough for you

 

 

> Allow fast writing of Decimal column to parquet
> ---
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089011#comment-17089011
 ] 

Jacek Pliszka edited comment on ARROW-8545 at 4/21/20, 7:53 PM:


OK, I checked and This is my version:

 
{code:java}
pat = pa.Table.from_pandas(df)
t3 = time()
print(t3-t2)
pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3)))
t4 = time()
print(t4 - t3)
pq.write_table(pat, '/tmp/testabd.pq')
t5 = time()
print(t5 - t4)
{code}
And we are getting here

A) 0.3s for conversion from pandas to arrow Table

B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast 
implemented from double to decimal(38, 3)

C) 2.8s for writing table to parquet file - is it fast enough for you

 

B and C are separate topics and should have separate issues. In B decimal128 
should be easier if this is enough for you

 

 


was (Author: jacek.pliszka):
OK, I checked and This is my version:

 
{code:java}
pat = pa.Table.from_pandas(df)
t3 = time()print(t3-t2)
pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3)))
t4 = time()
print(t4 - t3)
pq.write_table(pat, '/tmp/testabd.pq')
t5 = time()
print(t5 - t4)
{code}
And we are getting here

A) 0.3s for conversion from pandas to arrow Table

B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast 
implemented from double to decimal(38, 3)

C) 2.8s for writing table to parquet file - is it fast enough for you

 

B and C are separate topics and should have separate issues. In B decimal128 
should be easier if this is enough for you

 

 

> Allow fast writing of Decimal column to parquet
> ---
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089011#comment-17089011
 ] 

Jacek Pliszka commented on ARROW-8545:
--

OK, I checked and This is my version:

 
{code:java}
pat = pa.Table.from_pandas(df)
t3 = time()print(t3-t2)
pat.set_column(0, 'a', pat.column(0).cast(pa.decimal128(38, 3)))
t4 = time()
print(t4 - t3)
pq.write_table(pat, '/tmp/testabd.pq')
t5 = time()
print(t5 - t4)
{code}
And we are getting here

A) 0.3s for conversion from pandas to arrow Table

B) cast to decimal fails: pyarrow.lib.ArrowNotImplementedError: No cast 
implemented from double to decimal(38, 3)

C) 2.8s for writing table to parquet file - is it fast enough for you

 

B and C are separate topics and should have separate issues. In B decimal128 
should be easier if this is enough for you

 

 

> Allow fast writing of Decimal column to parquet
> ---
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088988#comment-17088988
 ] 

Jacek Pliszka edited comment on ARROW-8545 at 4/21/20, 7:20 PM:


Once you have decimal - is writing fast enough?

Because actually you are talking about 2 different things:
 # cast to arrow decimal - not sure if it is implemented but it is relatively 
easy
 # fast writing decimal to parquet - is it fast enough for you


was (Author: jacek.pliszka):
My suggestion would be to split into 2 pieces:

 
 # cast to decimal - not sure if it is implemented but it is relatively easy
 # writing decimal to parquet - not sure what is the status of it as well

> Allow fast writing of Decimal column to parquet
> ---
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-8543:
-

Assignee: Mayur Srivastava

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Assignee: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The current coalescing algorithm is a two pass algorithm (where N is number 
> of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an single pass algorithm that 
> computes coalesced ranges while making the first pass over the ranges in the 
> list. This algorithm is also shorter in lines of code and hence (hopefully) 
> more maintainable in long term.
> Correction: Post sorting, the current algorithm is O(N) and the improvement 
> is O(N). I called the current algo O(N^2) due to an oversight.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-8543.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7002
[https://github.com/apache/arrow/pull/7002]

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The current coalescing algorithm is a two pass algorithm (where N is number 
> of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an single pass algorithm that 
> computes coalesced ranges while making the first pass over the ranges in the 
> list. This algorithm is also shorter in lines of code and hence (hopefully) 
> more maintainable in long term.
> Correction: Post sorting, the current algorithm is O(N) and the improvement 
> is O(N). I called the current algo O(N^2) due to an oversight.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088988#comment-17088988
 ] 

Jacek Pliszka commented on ARROW-8545:
--

My suggestion would be to split into 2 pieces:

 
 # cast to decimal - not sure if it is implemented but it is relatively easy
 # writing decimal to parquet - not sure what is the status of it as well

> Allow fast writing of Decimal column to parquet
> ---
>
> Key: ARROW-8545
> URL: https://issues.apache.org/jira/browse/ARROW-8545
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Fons de Leeuw
>Priority: Minor
>
> Currently, when one wants to use a decimal datatype in Pandas, the only 
> possibility is to use the `decimal.Decimal` standard-libary type. This is 
> then an "object" column in the DataFrame.
> Arrow can write a column of decimal type to Parquet, which is quite 
> impressive given that [fastparquet does not write decimals|#data-types]] at 
> all. However, the writing is *very* slow, in the code snippet below a factor 
> of 4.
> *Improvements*
> Of course the best outcome would be if the conversion of a decimal column can 
> be made faster, but I am not familiar enough with pandas internals to know if 
> that's possible. (This same behavior also applies to `.to_pickle` etc.)
> It would be nice, if a warning is shown that object-typed columns are being 
> converted which is very slow. That would at least make this behavior more 
> explicit.
> Now, if fast parsing of a decimal.Decimal object column is not possible, it 
> would be nice if a workaround is possible. For example, pass an int and then 
> shift the dot "x" places to the left. (It is already possible to pass an int 
> column and specify "decimal" dtype in the Arrow schema during 
> `pa.Table.from_pandas()` but then it simply becomes a decimal without 
> decimals.) Also, it might be nice if it can be encoded as a 128-bit byte 
> string in the pandas column and then directly interpreted by Arrow.
> *Usecase*
> I need to save large dataframes (~10GB) of geospatial data with 
> latitude/longitude. I can't use float as comparisons need to be exact, and 
> the BigQuery "clustering" feature needs either an integer or a decimal but 
> not a float. In the meantime, I have to do a workaround where I use only ints 
> (the original number multiplied by 1000.)
> *Snippet*
> {code:java}
> import decimal
> from time import time
> import numpy as np
> import pandas as pd
> d = dict()
> for col in "abcdefghijklmnopqrstuvwxyz":
> d[col] = np.random.rand(int(1E7)) * 100
> df = pd.DataFrame(d)
> t0 = time()
> df.to_parquet("/tmp/testabc.pq", engine="pyarrow")
> t1 = time()
> df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)
> t2 = time()
> df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")
> t3 = time()
> print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal 
> column {t3-t2:.3f}s")
> # Saving the normal dataframe took 4.430s, with one decimal column 
> 17.673s{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Fons de Leeuw (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fons de Leeuw updated ARROW-8545:
-
Description: 
Currently, when one wants to use a decimal datatype in Pandas, the only 
possibility is to use the `decimal.Decimal` standard-libary type. This is then 
an "object" column in the DataFrame.

Arrow can write a column of decimal type to Parquet, which is quite impressive 
given that [fastparquet does not write decimals|#data-types]] at all. However, 
the writing is *very* slow, in the code snippet below a factor of 4.

*Improvements*

Of course the best outcome would be if the conversion of a decimal column can 
be made faster, but I am not familiar enough with pandas internals to know if 
that's possible. (This same behavior also applies to `.to_pickle` etc.)

It would be nice, if a warning is shown that object-typed columns are being 
converted which is very slow. That would at least make this behavior more 
explicit.

Now, if fast parsing of a decimal.Decimal object column is not possible, it 
would be nice if a workaround is possible. For example, pass an int and then 
shift the dot "x" places to the left. (It is already possible to pass an int 
column and specify "decimal" dtype in the Arrow schema during 
`pa.Table.from_pandas()` but then it simply becomes a decimal without 
decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string 
in the pandas column and then directly interpreted by Arrow.

*Usecase*

I need to save large dataframes (~10GB) of geospatial data with 
latitude/longitude. I can't use float as comparisons need to be exact, and the 
BigQuery "clustering" feature needs either an integer or a decimal but not a 
float. In the meantime, I have to do a workaround where I use only ints (the 
original number multiplied by 1000.)

*Snippet*
{code:java}
import decimal
from time import time

import numpy as np
import pandas as pd

d = dict()

for col in "abcdefghijklmnopqrstuvwxyz":
d[col] = np.random.rand(int(1E7)) * 100

df = pd.DataFrame(d)

t0 = time()

df.to_parquet("/tmp/testabc.pq", engine="pyarrow")

t1 = time()

df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)

t2 = time()

df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")

t3 = time()

print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal column 
{t3-t2:.3f}s")
# Saving the normal dataframe took 4.430s, with one decimal column 17.673s{code}
 

 

  was:
Currently, when one wants to use a decimal datatype in Pandas, the only 
possibility is to use the `decimal.Decimal` standard-libary type. This is then 
an "object" column in the DataFrame.

Arrow can write a column of decimal type to Parquet, which is quite impressive 
given that [fastparquet does not write 
decimals|[https://fastparquet.readthedocs.io/en/latest/details.html#data-types]]
 at all. However, the writing is *very* slow, in the code snippet below a 
factor of 4.

*Improvements***

Of course the best outcome would be if the conversion of a decimal column can 
be made faster, but I am not familiar enough with pandas internals to know if 
that's possible. (This same behavior also applies to `.to_pickle` etc.)

It would be nice, if a warning is shown that object-typed columns are being 
converted which is very slow. That would at least make this behavior more 
explicit.

Now, if fast parsing of a decimal.Decimal object column is not possible, it 
would be nice if a workaround is possible. For example, pass an int and then 
shift the dot "x" places to the left. (It is already possible to pass an int 
column and specify "decimal" dtype in the Arrow schema during 
`pa.Table.from_pandas()` but then it simply becomes a decimal without 
decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string 
in the pandas column and then directly interpreted by Arrow.

*Usecase*

I need to save large dataframes (~10GB) of geospatial data with 
latitude/longitude. I can't use float as comparisons need to be exact, and the 
BigQuery "clustering" feature needs either an integer or a decimal but not a 
float. In the meantime, I have to do a workaround where I use only ints (the 
original number multiplied by 1000.)

*Snippet*

 
{code:java}

{code}
*import decimal
from time import time

import numpy as np
import pandas as pd

d = dict()

for col in "abcdefghijklmnopqrstuvwxyz":
d[col] = np.random.rand(int(1E7)) * 100

df = pd.DataFrame(d)

t0 = time()

df.to_parquet("/tmp/testabc.pq", engine="pyarrow")

t1 = time()

df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)

t2 = time()

df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")

t3 = time()

print(f"Saving the normal dataframe took \{t1-t0:.3f}s, with one decimal column 
\{t3-t2:.3f}s")*

*# Saving the normal dataframe took 4.430s, with one decimal column 17.673s***

 

 


> Allow fast writing of Decimal column to parquet
> 

[jira] [Created] (ARROW-8545) Allow fast writing of Decimal column to parquet

2020-04-21 Thread Fons de Leeuw (Jira)
Fons de Leeuw created ARROW-8545:


 Summary: Allow fast writing of Decimal column to parquet
 Key: ARROW-8545
 URL: https://issues.apache.org/jira/browse/ARROW-8545
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.17.0
Reporter: Fons de Leeuw


Currently, when one wants to use a decimal datatype in Pandas, the only 
possibility is to use the `decimal.Decimal` standard-libary type. This is then 
an "object" column in the DataFrame.

Arrow can write a column of decimal type to Parquet, which is quite impressive 
given that [fastparquet does not write 
decimals|[https://fastparquet.readthedocs.io/en/latest/details.html#data-types]]
 at all. However, the writing is *very* slow, in the code snippet below a 
factor of 4.

*Improvements***

Of course the best outcome would be if the conversion of a decimal column can 
be made faster, but I am not familiar enough with pandas internals to know if 
that's possible. (This same behavior also applies to `.to_pickle` etc.)

It would be nice, if a warning is shown that object-typed columns are being 
converted which is very slow. That would at least make this behavior more 
explicit.

Now, if fast parsing of a decimal.Decimal object column is not possible, it 
would be nice if a workaround is possible. For example, pass an int and then 
shift the dot "x" places to the left. (It is already possible to pass an int 
column and specify "decimal" dtype in the Arrow schema during 
`pa.Table.from_pandas()` but then it simply becomes a decimal without 
decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string 
in the pandas column and then directly interpreted by Arrow.

*Usecase*

I need to save large dataframes (~10GB) of geospatial data with 
latitude/longitude. I can't use float as comparisons need to be exact, and the 
BigQuery "clustering" feature needs either an integer or a decimal but not a 
float. In the meantime, I have to do a workaround where I use only ints (the 
original number multiplied by 1000.)

*Snippet*

 
{code:java}

{code}
*import decimal
from time import time

import numpy as np
import pandas as pd

d = dict()

for col in "abcdefghijklmnopqrstuvwxyz":
d[col] = np.random.rand(int(1E7)) * 100

df = pd.DataFrame(d)

t0 = time()

df.to_parquet("/tmp/testabc.pq", engine="pyarrow")

t1 = time()

df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)

t2 = time()

df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")

t3 = time()

print(f"Saving the normal dataframe took \{t1-t0:.3f}s, with one decimal column 
\{t3-t2:.3f}s")*

*# Saving the normal dataframe took 4.430s, with one decimal column 17.673s***

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Srivastava updated ARROW-8543:

Description: 
The current coalescing algorithm is a two pass algorithm (where N is number of 
ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

Correction: Post sorting, the current algorithm is O(N) and the improvement is 
O(N). I called the current algo O(N^2) due to an oversight.

 

  was:
The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
number of ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an O(N) single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

Correction: Post sorting, the current algorithm is O(N) and the improvement is 
O(N).

 


> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a two pass algorithm (where N is number 
> of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an single pass algorithm that 
> computes coalesced ranges while making the first pass over the ranges in the 
> list. This algorithm is also shorter in lines of code and hence (hopefully) 
> more maintainable in long term.
> Correction: Post sorting, the current algorithm is O(N) and the improvement 
> is O(N). I called the current algo O(N^2) due to an oversight.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088966#comment-17088966
 ] 

Mayur Srivastava commented on ARROW-8543:
-

Thank you for fixing the PR! This is my first contribution so I'm not well 
versed.

 

Thanks,

Mayur

 

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
> Correction: Post sorting, the current algorithm is O(N) and the improvement 
> is O(N).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Srivastava updated ARROW-8543:

Description: 
The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
number of ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an O(N) single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

Correction: Post sorting, the current algorithm is O(N) and the improvement is 
O(N).

 

  was:
The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
number of ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an O(N) single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

 


> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
> Correction: Post sorting, the current algorithm is O(N) and the improvement 
> is O(N).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088959#comment-17088959
 ] 

Mayur Srivastava commented on ARROW-8543:
-

You are right.

I should correct myself. The correct statement is "change 2 pass algorithm 
(O(N)) with a constant 2 to 1 pass algorithm (O(N)) with a constant 1".

What do you think?

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088949#comment-17088949
 ] 

Antoine Pitrou commented on ARROW-8543:
---

That part is O(N). You missed the {{start = next}} line.

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088946#comment-17088946
 ] 

Mayur Srivastava commented on ARROW-8543:
-

Hi [~apitrou],

Thanks for looking into it!

I agree on O(N log N) due to sorting. My comment was mainly on post sorting 
algorithm.

 

The algorithm is as follows (let me know if I'm making a mistake) (ref: 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.cc]):
{noformat}
start=ranges.begin(), prev=start, next=prev

while (++next != end) {
  if (isLargerThanHole(prev, next)) {
if (next - start > 1) {
  CoalesceUtilLargeEnough(start, next, coalesced output)
} 
else {
  // append start to coalesced output
}
start = next;
  }
  prev = next;
}

if (next - start > 1) {
  CoalesceUtilLargeEnough(start, next, coalesced output)
} 
else {
  // append start to coalesced output
}{noformat}
To simplify, let's assume there is no hole in the ranges.

Then, we increment 'next' till the end in the 'while' loop. This is the first 
pass iterating from start to end in Coalesce().

Then we call CoalesceUtilLargeEnough() which will iterate from start to end and 
append coalesced ranges to the result. This is the second pass in 
CoalesceUntilLargeEnough().

This is the reason I called it two pass algorithm.

 

My proposal ([https://github.com/apache/arrow/pull/7002]) is to change it to a 
single pass.

Other than single pass benefits, it does less copying and is shorter in number 
of lines of code.

 

Let me know what you think about it.

 

Thanks,

Mayur

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8544) [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to get around rate limiting

2020-04-21 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-8544.

Resolution: Fixed

Issue resolved by pull request 6994
[https://github.com/apache/arrow/pull/6994]

> [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to 
> get around rate limiting
> --
>
> Key: ARROW-8544
> URL: https://issues.apache.org/jira/browse/ARROW-8544
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration, Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Crossbow already queries commit statuses for generating a static github page. 
> Use this mechanism to reduce the required github api calls on the future 
> dashboard by serializing the already queried API responses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8544) [CI][Crossbow] Add a status.json to the gh-pages summary of nightly builds to get around rate limiting

2020-04-21 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8544:
--

 Summary: [CI][Crossbow] Add a status.json to the gh-pages summary 
of nightly builds to get around rate limiting
 Key: ARROW-8544
 URL: https://issues.apache.org/jira/browse/ARROW-8544
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Continuous Integration, Developer Tools
Reporter: Krisztian Szucs
Assignee: Ben Kietzman
 Fix For: 1.0.0


Crossbow already queries commit statuses for generating a static github page. 
Use this mechanism to reduce the required github api calls on the future 
dashboard by serializing the already queried API responses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8543:
--
Labels: pull-request-available  (was: )

> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>  Labels: pull-request-available
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088901#comment-17088901
 ] 

Antoine Pitrou commented on ARROW-8543:
---

Can you explain why the current algorithm is O(N^2)? CoalesceUntilLargeEnough() 
is called on disjoint subsets of the ranges, so each range is examined only 
once.

(however, the algorithm is O(N log N) due to sorting)


> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8529) [C++] Fix usage of NextCounts() in GetBatchWithDict[Spaced]

2020-04-21 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-8529.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6991
[https://github.com/apache/arrow/pull/6991]

> [C++] Fix usage of NextCounts() in GetBatchWithDict[Spaced]
> ---
>
> Key: ARROW-8529
> URL: https://issues.apache.org/jira/browse/ARROW-8529
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> See discussion in ARROW-8486



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayur Srivastava updated ARROW-8543:

Description: 
The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
number of ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an O(N) single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

 

  was:
The current coalescing algorithm is a O(n^2) two pass algorithm (where n is 
number of ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an O(n) single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

 


> [C++] IO: single pass coalescing algorithm
> --
>
> Key: ARROW-8543
> URL: https://issues.apache.org/jira/browse/ARROW-8543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mayur Srivastava
>Priority: Major
>
> The current coalescing algorithm is a O(N^2) two pass algorithm (where N is 
> number of ranges) (first implemented in 
> https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
> Coalesce() function finds the begin and end of a candidate range that can be 
> coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
> over the ranges from begin to end and adds coalesced range to the result 
> (out).
> The proposal is to convert the algorithm to an O(N) single pass algorithm 
> that computes coalesced ranges while making the first pass over the ranges in 
> the list. This algorithm is also shorter in lines of code and hence 
> (hopefully) more maintainable in long term.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8543) [C++] IO: single pass coalescing algorithm

2020-04-21 Thread Mayur Srivastava (Jira)
Mayur Srivastava created ARROW-8543:
---

 Summary: [C++] IO: single pass coalescing algorithm
 Key: ARROW-8543
 URL: https://issues.apache.org/jira/browse/ARROW-8543
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Mayur Srivastava


The current coalescing algorithm is a O(n^2) two pass algorithm (where n is 
number of ranges) (first implemented in 
https://issues.apache.org/jira/browse/ARROW-7995). In the first pass, the 
Coalesce() function finds the begin and end of a candidate range that can be 
coalesced. In the second, pass the CoalesceUntilLargeEnough() function goes 
over the ranges from begin to end and adds coalesced range to the result (out).

The proposal is to convert the algorithm to an O(n) single pass algorithm that 
computes coalesced ranges while making the first pass over the ranges in the 
list. This algorithm is also shorter in lines of code and hence (hopefully) 
more maintainable in long term.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-2714) [C++/Python] Variable step size slicing for arrays

2020-04-21 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2714.
-
Resolution: Fixed

Issue resolved by pull request 6970
[https://github.com/apache/arrow/pull/6970]

> [C++/Python] Variable step size slicing for arrays
> --
>
> Key: ARROW-2714
> URL: https://issues.apache.org/jira/browse/ARROW-2714
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Florian Jetter
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Array slicing should support variable step sizes
> The current behavior raises an {{IndexError}}, e.g.
> {code}
> In [8]: import pyarrow as pa
> In [9]: pa.array([1, 2, 3])[::-1]
> ---
> IndexError Traceback (most recent call last)
>  in ()
> > 1 pa.array([1, 2, 3])[::-1]
> array.pxi in pyarrow.lib.Array.__getitem__()
> array.pxi in pyarrow.lib._normalize_slice()
> IndexError: only slices with step 1 supported
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088727#comment-17088727
 ] 

Andy Grove commented on ARROW-8536:
---

[~d...@danburkert.com] I wonder if you could provide some guidance on this?

cc [~paddyhoran] [~nevime]

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Critical
> Fix For: 1.0.0
>
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a "format" directory in the root of your file 
> system (or at least at a higher level than where cargo is building code) and 
> place the Flight.proto file there (making sure to use the 0.17.0 version, 
> which can be found in the source release [1]).
>  [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8536:
--
Description: 
When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
error "Failed to locate format/Flight.proto in any parent directory". This is 
caused by the custom build script in the arrow-flight crate, which expects to 
find a "format/Flight.proto" file in a parent directory. This works when 
building the crate from within the Arrow source tree, but unfortunately doesn't 
work for the published crate, since the Flight.proto file was not published as 
part of the crate.

The workaround is to create a "format" directory in the root of your file 
system (or at least at a higher level than where cargo is building code) and 
place the Flight.proto file there (making sure to use the 0.17.0 version, which 
can be found in the source release [1]).

 [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0]

 

  was:
When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
error "Failed to locate format/Flight.proto in any parent directory". This is 
caused by the custom build script in the arrow-flight crate, which expects to 
find a "format/Flight.proto" file in a parent directory. This works when 
building the crate from within the Arrow source tree, but unfortunately doesn't 
work for the published crate, since the Flight.proto file was not published as 
part of the crate.

The workaround is to create a top-level "format" directory in your Rust project 
and place the Flight.proto file there (making sure to use the 0.17.0 version, 
which can be found in the source release [1]).

 [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0

 


> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Critical
> Fix For: 1.0.0
>
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a "format" directory in the root of your file 
> system (or at least at a higher level than where cargo is building code) and 
> place the Flight.proto file there (making sure to use the 0.17.0 version, 
> which can be found in the source release [1]).
>  [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088678#comment-17088678
 ] 

Antoine Pitrou commented on ARROW-8539:
---

Thank you!

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Assignee: Yosuke Shiro
>Priority: Critical
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Yosuke Shiro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088674#comment-17088674
 ] 

Yosuke Shiro commented on ARROW-8539:
-

[https://github.com/Homebrew/homebrew-core/pull/53445.] has been merged.

CI is green. 
[https://github.com/apache/arrow/pull/6991/checks?check_run_id=605238698]

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Assignee: Yosuke Shiro
>Priority: Critical
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Yosuke Shiro (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-8539.
-
Resolution: Resolved

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Assignee: Yosuke Shiro
>Priority: Critical
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-8536:
-

Assignee: Andy Grove

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Critical
> Fix For: 1.0.0
>
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a top-level "format" directory in your Rust 
> project and place the Flight.proto file there (making sure to use the 0.17.0 
> version, which can be found in the source release [1]).
>  [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8536:
--
Description: 
When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
error "Failed to locate format/Flight.proto in any parent directory". This is 
caused by the custom build script in the arrow-flight crate, which expects to 
find a "format/Flight.proto" file in a parent directory. This works when 
building the crate from within the Arrow source tree, but unfortunately doesn't 
work for the published crate, since the Flight.proto file was not published as 
part of the crate.

The workaround is to create a top-level "format" directory in your Rust project 
and place the Flight.proto file there (making sure to use the 0.17.0 version, 
which can be found in the source release [1]).

 [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0

 

  was:
When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
error "Failed to locate format/Flight.proto in any parent directory".

The workaround is to create a top-level "format" directory in your Rust project 
and place the Flight.proto file there (making sure to use the 0.17.0 version

 


> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Priority: Critical
> Fix For: 1.0.0
>
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory". This is 
> caused by the custom build script in the arrow-flight crate, which expects to 
> find a "format/Flight.proto" file in a parent directory. This works when 
> building the crate from within the Arrow source tree, but unfortunately 
> doesn't work for the published crate, since the Flight.proto file was not 
> published as part of the crate.
> The workaround is to create a top-level "format" directory in your Rust 
> project and place the Flight.proto file there (making sure to use the 0.17.0 
> version, which can be found in the source release [1]).
>  [1] https://github.com/apache/arrow/releases/tag/apache-arrow-0.17.0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-8536:
--
Description: 
When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
error "Failed to locate format/Flight.proto in any parent directory".

The workaround is to create a top-level "format" directory in your Rust project 
and place the Flight.proto file there (making sure to use the 0.17.0 version

 

  was:
When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
error "Failed to locate format/Flight.proto in any parent directory".

The workaround is to create a directoy `/format` in the root of your file 
system and place the Flight.proto file there.

 


> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Priority: Critical
> Fix For: 1.0.0
>
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory".
> The workaround is to create a top-level "format" directory in your Rust 
> project and place the Flight.proto file there (making sure to use the 0.17.0 
> version
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8065:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
> -
>
> Key: ARROW-8065
> URL: https://issues.apache.org/jira/browse/ARROW-8065
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
>
> Currently: a fragment is a product of a scan; it is a lazy collection of scan 
> tasks corresponding to a data source which is logically singular (like a 
> single file, a single row group, ...). It would be more useful if instead a 
> fragment were the direct object of a scan; one scans a fragment (or a 
> collection of fragments):
>  # Remove {{ScanOptions}} from Fragment's properties and move it into 
> {{Fragment::Scan}} parameters.
>  # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an 
> overload to support predicate pushdown in FileSystemDataset and UnionDataset 
> {{Dataset::GetFragments(std::shared_ptr predicate)}}.
>  # Expose lazy accessor to Fragment::physical_schema()
>  # Consolidate ScanOptions and ScanContext
> This will lessen the cognitive dissonance between fragments and files since 
> fragments will no longer include references to scan properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8542) [Release] Fix checksum url in the website post release script

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8542:
--
Labels: pull-request-available  (was: )

> [Release] Fix checksum url in the website post release script
> -
>
> Key: ARROW-8542
> URL: https://issues.apache.org/jira/browse/ARROW-8542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The issue was captured here 
> https://github.com/apache/arrow-site/pull/53#discussion_r411728907



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8542) [Release] Fix checksum url in the website post release script

2020-04-21 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8542:
--

 Summary: [Release] Fix checksum url in the website post release 
script
 Key: ARROW-8542
 URL: https://issues.apache.org/jira/browse/ARROW-8542
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


The issue was captured here 
https://github.com/apache/arrow-site/pull/53#discussion_r411728907



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8541) [Release] Don't remove previous source releases automatically

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8541:
--
Labels: pull-request-available  (was: )

> [Release] Don't remove previous source releases automatically
> -
>
> Key: ARROW-8541
> URL: https://issues.apache.org/jira/browse/ARROW-8541
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We should keep at least the last three source tarballs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8541) [Release] Don't remove previous source releases automatically

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088606#comment-17088606
 ] 

Antoine Pitrou commented on ARROW-8541:
---

I see all source releases here: https://archive.apache.org/dist/arrow/

> [Release] Don't remove previous source releases automatically
> -
>
> Key: ARROW-8541
> URL: https://issues.apache.org/jira/browse/ARROW-8541
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> We should keep at least the last three source tarballs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8541) [Release] Don't remove previous source releases automatically

2020-04-21 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8541:
--

 Summary: [Release] Don't remove previous source releases 
automatically
 Key: ARROW-8541
 URL: https://issues.apache.org/jira/browse/ARROW-8541
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


We should keep at least the last three source tarballs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8536:
---
Fix Version/s: 1.0.0

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Priority: Critical
> Fix For: 1.0.0
>
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory".
> The workaround is to create a directoy `/format` in the root of your file 
> system and place the Flight.proto file there.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8536) [Rust] Failed to locate format/Flight.proto in any parent directory

2020-04-21 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8536:
---
Priority: Critical  (was: Major)

> [Rust] Failed to locate format/Flight.proto in any parent directory
> ---
>
> Key: ARROW-8536
> URL: https://issues.apache.org/jira/browse/ARROW-8536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Andy Grove
>Priority: Critical
>
> When using Arrow 0.17.0 as a dependency, it is likely that you will get the 
> error "Failed to locate format/Flight.proto in any parent directory".
> The workaround is to create a directoy `/format` in the root of your file 
> system and place the Flight.proto file there.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8540) [C++] Create memory allocation benchmark

2020-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8540:
--
Labels: pull-request-available  (was: )

> [C++] Create memory allocation benchmark
> 
>
> Key: ARROW-8540
> URL: https://issues.apache.org/jira/browse/ARROW-8540
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> To judge the overhead of repeated allocations and deallocations (e.g. for 
> temporary computation results).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8540) [C++] Create memory allocation benchmark

2020-04-21 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-8540:
-

Assignee: Antoine Pitrou

> [C++] Create memory allocation benchmark
> 
>
> Key: ARROW-8540
> URL: https://issues.apache.org/jira/browse/ARROW-8540
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>
> To judge the overhead of repeated allocations and deallocations (e.g. for 
> temporary computation results).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8540) [C++] Create memory allocation benchmark

2020-04-21 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8540:
--
Description: To judge the overhead of repeated allocations and 
deallocations (e.g. for temporary computation results).  (was: To judge of 
overhead of repeated allocations and deallocations (e.g. for temporary 
computation results).)

> [C++] Create memory allocation benchmark
> 
>
> Key: ARROW-8540
> URL: https://issues.apache.org/jira/browse/ARROW-8540
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> To judge the overhead of repeated allocations and deallocations (e.g. for 
> temporary computation results).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8540) [C++] Create memory allocation benchmark

2020-04-21 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8540:
-

 Summary: [C++] Create memory allocation benchmark
 Key: ARROW-8540
 URL: https://issues.apache.org/jira/browse/ARROW-8540
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


To judge of overhead of repeated allocations and deallocations (e.g. for 
temporary computation results).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Yosuke Shiro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088530#comment-17088530
 ] 

Yosuke Shiro commented on ARROW-8539:
-

This caused by llvm@8(Homebrew).

I sent https://github.com/Homebrew/homebrew-core/pull/53445. If this PR has 
been merged, I'll check if CI passes.

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Priority: Critical
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Yosuke Shiro (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro reassigned ARROW-8539:
---

Assignee: Yosuke Shiro

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Assignee: Yosuke Shiro
>Priority: Critical
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088499#comment-17088499
 ] 

Antoine Pitrou commented on ARROW-8539:
---

cc [~kou]

> [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
> --
>
> Key: ARROW-8539
> URL: https://issues.apache.org/jira/browse/ARROW-8539
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C, Continuous Integration, GLib
>Reporter: Antoine Pitrou
>Priority: Critical
>
> See e.g.
> https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868
> {code}
> [192/256] Generating arithmetic_ops.bc
> FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
> cd 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
>  && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
> SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
> -DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
> /Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
>  -o 
> /Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
>  -isysroot 
> /Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
>  -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
> dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
>   Referenced from: /usr/local/opt/llvm@8/bin/clang-8
>   Reason: image not found
> Child aborted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8539) [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails

2020-04-21 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8539:
-

 Summary: [CI] "AMD64 MacOS 10.15 GLib & Ruby" fails
 Key: ARROW-8539
 URL: https://issues.apache.org/jira/browse/ARROW-8539
 Project: Apache Arrow
  Issue Type: Bug
  Components: C, Continuous Integration, GLib
Reporter: Antoine Pitrou


See e.g.
https://github.com/apache/arrow/pull/6991/checks?check_run_id=604703868

{code}
[192/256] Generating arithmetic_ops.bc
FAILED: src/gandiva/precompiled/arithmetic_ops.bc 
cd 
/Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled
 && /usr/local/Cellar/cmake/3.17.1/bin/cmake -E env 
SDKROOT=/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
 /usr/local/opt/llvm@8/bin/clang-8 -std=c++11 -DGANDIVA_IR -DNDEBUG 
-DARROW_STATIC -DGANDIVA_STATIC -fno-use-cxa-atexit -emit-llvm -O3 -c 
/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src/gandiva/precompiled/arithmetic_ops.cc
 -o 
/Users/runner/runners/2.169.0/work/arrow/arrow/build/cpp/src/gandiva/precompiled/arithmetic_ops.bc
 -isysroot 
/Applications/Xcode_11.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
 -I/Users/runner/runners/2.169.0/work/arrow/arrow/cpp/src
dyld: Library not loaded: /usr/local/opt/z3/lib/libz3.dylib
  Referenced from: /usr/local/opt/llvm@8/bin/clang-8
  Reason: image not found
Child aborted
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8534) [C++][CSV] Issue building CSV component under GCC 6.1.0

2020-04-21 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088410#comment-17088410
 ] 

Antoine Pitrou commented on ARROW-8534:
---

Could you please open a PR?

> [C++][CSV] Issue building CSV component under GCC 6.1.0
> ---
>
> Key: ARROW-8534
> URL: https://issues.apache.org/jira/browse/ARROW-8534
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.0
> Environment: Centos 7 x86_64
>Reporter: Ross Wolfson
>Priority: Minor
>
> Hi,
> In the current version (0.17.0), it seems that CSV reader.cc fails to compile 
> when using GCC 6.1.0. This builds when using older or newer GCC versions (we 
> tested with 4.8.5, 8.2.0 and 9.3.0).
>  
> {{[root@1d4fcfc2580e arrow_src]# /ourgcc/gcc-6.1.0/bin/g++ -c 
> cpp/src/arrow/csv/reader.cc -I cpp/src}}
> {{cpp/src/arrow/csv/reader.cc: In constructor 
> 'arrow::csv::SerialBlockReader::SerialBlockReader(std::unique_ptr,
>  arrow::Iterator >, 
> std::shared_ptr)':}}
> {{cpp/src/arrow/csv/reader.cc:178:22: error: use of deleted function 
> 'std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) 
> [with _Tp = arrow::Chunker; _Dp = std::default_delete]'}}
> {{ using BlockReader::BlockReader;}}
> {{ ^~~}}
> {{In file included from /ourgcc/gcc-6.1.0/include/c++/6.1.0/memory:81:0,}}
> {{ from cpp/src/arrow/csv/reader.h:20,}}
> {{ from cpp/src/arrow/csv/reader.cc:18:}}
> {{/ourgcc/gcc-6.1.0/include/c++/6.1.0/bits/unique_ptr.h:356:7: note: declared 
> here}}
> {{ unique_ptr(const unique_ptr&) = delete;}}
> {{ ^~}}
> {{cpp/src/arrow/csv/reader.cc:178:22: error: use of deleted function 
> 'arrow::Iterator >::Iterator(const 
> arrow::Iterator >&)'}}
> {{ using BlockReader::BlockReader;}}
> {{ ^~~}}
> {{In file included from cpp/src/arrow/csv/reader.cc:43:0:}}
> {{cpp/src/arrow/util/iterator.h:63:7: note: 
> 'arrow::Iterator >::Iterator(const 
> arrow::Iterator >&)' is implicitly deleted 
> because the default definition would be ill-formed:}}
> {{ class Iterator : public util::EqualityComparable> {}}
> {{ ^~~~}}
> {{cpp/src/arrow/util/iterator.h:63:7: error: use of deleted function 
> 'std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) 
> [with _Tp = void; _Dp = void (*)(void*)]'}}
> {{In file included from /ourgcc/gcc-6.1.0/include/c++/6.1.0/memory:81:0,}}
> {{ from cpp/src/arrow/csv/reader.h:20,}}
> {{ from cpp/src/arrow/csv/reader.cc:18:}}
> {{/ourgcc/gcc-6.1.0/include/c++/6.1.0/bits/unique_ptr.h:356:7: note: declared 
> here}}
> {{ unique_ptr(const unique_ptr&) = delete;}}
> {{ ^~}}
> {{cpp/src/arrow/csv/reader.cc: In member function 'virtual 
> arrow::Result > 
> arrow::csv::SerialTableReader::Read()':}}
> {{cpp/src/arrow/csv/reader.cc:750:88: note: synthesized method 
> 'arrow::csv::SerialBlockReader::SerialBlockReader(std::unique_ptr,
>  arrow::Iterator >, 
> std::shared_ptr)' first required here}}
> {{ std::move(buffer_iterator_), std::move(first_buffer));}}
> {{ ^}}
> {{cpp/src/arrow/csv/reader.cc: In constructor 
> 'arrow::csv::ThreadedBlockReader::ThreadedBlockReader(std::unique_ptr,
>  arrow::Iterator >, 
> std::shared_ptr)':}}
> {{cpp/src/arrow/csv/reader.cc:221:22: error: use of deleted function 
> 'std::unique_ptr<_Tp, _Dp>::unique_ptr(const std::unique_ptr<_Tp, _Dp>&) 
> [with _Tp = arrow::Chunker; _Dp = std::default_delete]'}}
> {{ using BlockReader::BlockReader;}}
> {{ ^~~}}
> {{In file included from /ourgcc/gcc-6.1.0/include/c++/6.1.0/memory:81:0,}}
> {{ from cpp/src/arrow/csv/reader.h:20,}}
> {{ from cpp/src/arrow/csv/reader.cc:18:}}
> {{/ourgcc/gcc-6.1.0/include/c++/6.1.0/bits/unique_ptr.h:356:7: note: declared 
> here}}
> {{ unique_ptr(const unique_ptr&) = delete;}}
> {{ ^~}}
> {{cpp/src/arrow/csv/reader.cc:221:22: error: use of deleted function 
> 'arrow::Iterator >::Iterator(const 
> arrow::Iterator >&)'}}
> {{ using BlockReader::BlockReader;}}
> {{ ^~~}}
> {{cpp/src/arrow/csv/reader.cc: In member function 'virtual 
> arrow::Result > 
> arrow::csv::ThreadedTableReader::Read()':}}
> {{cpp/src/arrow/csv/reader.cc:815:61: note: synthesized method 
> 'arrow::csv::ThreadedBlockReader::ThreadedBlockReader(std::unique_ptr,
>  arrow::Iterator >, 
> std::shared_ptr)' first required here}}
> {{ std::move(first_buffer));}}
> {{ ^}}
>  
> My colleague found a workaround that avoids the build error, however, we are 
> not clear if this is the "best" fix.
> {{--- a/cpp/src/arrow/csv/reader.cc}}
> {{+++ b/cpp/src/arrow/csv/reader.cc}}
> {{@@ -175,7 +175,12 @@ class BlockReader {}}
> {{ // using CSVBlock::consume_bytes.}}
> {{ class SerialBlockReader : public BlockReader {}}
> {{ public:}}
> {{- using BlockReader::BlockReader;}}
> {{+ SerialBlockReader(std::unique_ptr chunker,}}
> {{+ Iterator> 

[jira] [Commented] (ARROW-6976) Possible memory leak in pyarrow read_parquet

2020-04-21 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088344#comment-17088344
 ] 

Joris Van den Bossche commented on ARROW-6976:
--

[~Athlete_369] that can be possible, depending on your file. Parquet can be 
highly compressed, so giving a big difference between file size and size in 
memory for pandas. You can check the memory usage of your resulting pandas 
DataFrame with {{df.info(memory_usage="deep")}}. How much does that indicate?

> Possible memory leak in pyarrow read_parquet
> 
>
> Key: ARROW-6976
> URL: https://issues.apache.org/jira/browse/ARROW-6976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: linux ubuntu 18.04
>Reporter: david cottrell
>Priority: Critical
> Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, 
> pyarrow_0150.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8537) [C++] Performance regression from ARROW-8523

2020-04-21 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088338#comment-17088338
 ] 

Yibo Cai commented on ARROW-8537:
-

Add analysis: https://github.com/apache/arrow/pull/6986#issuecomment-616978252

> [C++] Performance regression from ARROW-8523
> 
>
> Key: ARROW-8537
> URL: https://issues.apache.org/jira/browse/ARROW-8537
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Priority: Major
>
> I optimized BitmapReader in [this 
> PR|https://github.com/apache/arrow/pull/6986] and see performance uplift of 
> BitmapReader test case. I didn't check other test cases as this change looks 
> trivial.
> I reviewed all test cases just now and see big performance drop of 4 cases, 
> details at [PR 
> link|https://github.com/apache/arrow/pull/6986#issuecomment-616915079].
> I also compared performance of code using BitmapReader, no obvious changes 
> found. Looks we should revert that PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)