[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Benson Muite (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421945#comment-17421945
 ] 

Benson Muite commented on ARROW-14152:
--

devtoolset 3 was EOL 5 years ago 
[https://www.softwarecollections.org/en/scls/rhscl/devtoolset-3/] Thanks for 
the ticket https://issues.apache.org/jira/browse/ARROW-14160

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14157) [C++] Refactor Abseil build in ThirdpartyToolchain

2021-09-28 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-14157.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11261
[https://github.com/apache/arrow/pull/11261]

> [C++] Refactor Abseil build in ThirdpartyToolchain
> --
>
> Key: ARROW-14157
> URL: https://issues.apache.org/jira/browse/ARROW-14157
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Both google-cloud-cpp and gRPC depend on Abseil.  We need to refactor the 
> Abseil build to its own macro so it can be more easily reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14164) [C++][Dataset] Enhance dataset writer to allow file-per-batch

2021-09-28 Thread Weston Pace (Jira)
Weston Pace created ARROW-14164:
---

 Summary: [C++][Dataset] Enhance dataset writer to allow 
file-per-batch
 Key: ARROW-14164
 URL: https://issues.apache.org/jira/browse/ARROW-14164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


The dataset writer currently groups incoming batches into large files which are 
controlled by max_rows_per_file.  In the PR for this work [~jorisvandenbossche] 
recommended an option where each incoming batch creates a new file.

This would give the user fine grained control over how many files are created 
(provided they are doing a very basic scan/filter/project and not using any 
more sophisticated nodes which may resize batches.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14163) [C++] Naive spillover implementation for join

2021-09-28 Thread Weston Pace (Jira)
Weston Pace created ARROW-14163:
---

 Summary: [C++] Naive spillover implementation for join
 Key: ARROW-14163
 URL: https://issues.apache.org/jira/browse/ARROW-14163
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


A join is a pipeline breaker.  I believe the proposed join operators assume 
that the data can fit into memory and queue all incoming batches.  For example, 
if I understand correctly, https://github.com/apache/arrow/pull/11150 queues 
the right side until the left side had finished.

There are many clever and interesting ways that this can be optimized  (divide 
& conquer, recursive query, prioritize reading the left side and pause the 
right side read).  This issue is intentionally not clever or interesting.

Instead, I think it would be good to take advantage of this opportunity to 
start fleshing out our spillover capabilities.  A very simplistic 
implementation could be a standalone node which has 2 inputs and 2 outputs.  
The node queues up all incoming data on the "right" input and lets the "left" 
input pass through.  Then, when the left input has finished the node will 
release the right input.

This node could then implement a basic spillover mechanism (e.g. IPC to disk) 
and start to flesh out the abstractions that we will eventually want to handle 
different spillover strategies  (abort on spill, spill to disk, and spill to s3 
are all I can think of at the moment).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2021-09-28 Thread Rares Vernica (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Vernica updated ARROW-14161:
--
Description: 
Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_
 * Typo in file reader 
[example|https://arrow.apache.org/docs/cpp/parquet.html#filereader]  the 
include should be {{#include "parquet/arrow/reader.h"}}

  was:
Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_


> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Rares Vernica
>Priority: Minor
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader]  the 
> include should be {{#include "parquet/arrow/reader.h"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14162) [R] Simple arrange %>% head does not respect ordering

2021-09-28 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421912#comment-17421912
 ] 

Weston Pace commented on ARROW-14162:
-

The call to `head` is triggering an (immediate?) call to the legacy scanner 
head method.  The resulting dataset is then returned.  Then the remaining dplyr 
execution is resolved against the in-memory data.  ExecPlan is not used at all. 
 So it is first fetching the first 4 rows and then sorting instead of sorting 
and then fetching.

If this is truly a blocker for 6.0.0 then it might be an problem.  The head 
can't be applied in R because it would read in all of the data (presumably you 
could abort the read partway through but I think this would be overly complex).

If we want to do a proper ordered head in C++ then my recommendation would be 
the batch index scheme proposed in the sequencing doc 
[here](https://docs.google.com/document/d/1MfVE9td9D4n5y-PTn66kk4-9xG7feXs1zSFf-qxQgPs/edit?usp=sharing)
 but I'm not sure we want to tackle that as part of 6.0.0.

As a short term solution we can modify the sorting sink node to accept a limit 
argument.  That should be a reasonably quick solution and could maybe fit in 
6.0.0 but I'm not sure how much time we want to invest in stop-gap measures.

> [R] Simple arrange %>% head does not respect ordering
> -
>
> Key: ARROW-14162
> URL: https://issues.apache.org/jira/browse/ARROW-14162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Weston Pace
>Priority: Blocker
>
> This was originally reported by [~jonkeane] in ARROW-13893 but that issue was 
> covering a different topic so I am opening a new issue for this specific 
> behavior.
> {code:r}
> > library(arrow)
> > library(dplyr)
> > 
> > tab <- Table$create(mtcars)
> > 
> > tab %>% 
> +   arrange(mpg) %>% 
> +   head(4) %>% 
> +   collect()
>mpg cyl disp  hp dratwt  qsec vs am gear carb
> 1 21.0   6  160 110 3.90 2.620 16.46  0  144
> 2 21.0   6  160 110 3.90 2.875 17.02  0  144
> 3 22.8   4  108  93 3.85 2.320 18.61  1  141
> 4 21.4   6  258 110 3.08 3.215 19.44  1  031
> > 
> > mtcars %>% 
> +   arrange(mpg) %>% 
> +   head(4) %>% 
> +   collect()
>  mpg cyl disp  hp dratwt  qsec vs am gear carb
> Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  034
> Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  034
> Camaro Z28  13.3   8  350 245 3.73 3.840 15.41  0  034
> Duster 360  14.3   8  360 245 3.21 3.570 15.84  0  034
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14162) [R] Simple arrange %>% head does not respect ordering

2021-09-28 Thread Weston Pace (Jira)
Weston Pace created ARROW-14162:
---

 Summary: [R] Simple arrange %>% head does not respect ordering
 Key: ARROW-14162
 URL: https://issues.apache.org/jira/browse/ARROW-14162
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Weston Pace


This was originally reported by [~jonkeane] in ARROW-13893 but that issue was 
covering a different topic so I am opening a new issue for this specific 
behavior.

{code:r}
> library(arrow)
> library(dplyr)
> 
> tab <- Table$create(mtcars)
> 
> tab %>% 
+   arrange(mpg) %>% 
+   head(4) %>% 
+   collect()
   mpg cyl disp  hp dratwt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  144
2 21.0   6  160 110 3.90 2.875 17.02  0  144
3 22.8   4  108  93 3.85 2.320 18.61  1  141
4 21.4   6  258 110 3.08 3.215 19.44  1  031
> 
> mtcars %>% 
+   arrange(mpg) %>% 
+   head(4) %>% 
+   collect()
 mpg cyl disp  hp dratwt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  034
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  034
Camaro Z28  13.3   8  350 245 3.73 3.840 15.41  0  034
Duster 360  14.3   8  360 245 3.21 3.570 15.84  0  034
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12846) [Release] Improve upload of binaries

2021-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12846:
---
Labels: pull-request-available  (was: )

> [Release] Improve upload of binaries
> 
>
> Key: ARROW-12846
> URL: https://issues.apache.org/jira/browse/ARROW-12846
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Jorge Leitão
>Assignee: Kouhei Sutou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Running  dev/release/05-binary-upload.sh takes a long time and is prone to 
> network failures, etc. When it fails, it needs to be started from scratch.
> IMO we could alleviate this. An idea here would be to run the script in the 
> same order of the configuration variables that it has (e.g. 
> `UPLOAD_AMAZON_LINUX_RPM`) and echo the variable when binaries corresponding 
> to that section are complete.
> This way, when something fails, as a user I can pass r.g. 
> `UPLOAD_AMAZON_LINUX_RPM=0` and skip the parts that were already uploaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2021-09-28 Thread Rares Vernica (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Vernica updated ARROW-14161:
--
Priority: Minor  (was: Major)

> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Rares Vernica
>Priority: Minor
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2021-09-28 Thread Rares Vernica (Jira)
Rares Vernica created ARROW-14161:
-

 Summary: [C++][Parquet][Docs] Reading/Writing Parquet Files
 Key: ARROW-14161
 URL: https://issues.apache.org/jira/browse/ARROW-14161
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Rares Vernica


Missing documentation on Reading/Writing Parquet files C++ api:
 * 
[WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
 missing docs on chunk_size found some 
[here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
 _size of the RowGroup in the parquet file. Normally you would choose this to 
be rather large_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14122) [C++] interval comparison kernels

2021-09-28 Thread QP Hou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421853#comment-17421853
 ] 

QP Hou commented on ARROW-14122:


> So then, if I understand correctly, the point on hashing comes down to 
> whether or not the cast from Arrow Interval to Postgres Interval is a 
> zero-copy metadata only cast or the bytes need to be mutated for consistent 
> hashing.

Yes, regardless how it is casted to the postgres interval type, normalization 
needs to be applied to the postgres interval type before we can perform hashing 
and comparison operation on it.

I fee like we can leave it to individual compute engine implementations to 
decide how they want to perform the cast. They could choose to do zero-copy 
metadata cast or use a reference point in time to take leap seconds/days/months 
into account depending on the contract they provide to the users.

> [C++] interval comparison kernels
> -
>
> Key: ARROW-14122
> URL: https://issues.apache.org/jira/browse/ARROW-14122
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Phillip Cloud
>Priority: Major
>  Labels: kernel
>
> Subtask for tracking interval comparison kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14109) Segfault When Reading JSON With Duplicate Keys

2021-09-28 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-14109.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11222
[https://github.com/apache/arrow/pull/11222]

> Segfault When Reading JSON With Duplicate Keys
> --
>
> Key: ARROW-14109
> URL: https://issues.apache.org/jira/browse/ARROW-14109
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: William Butler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When arrow attempts to parse JSON with duplicate keys and no explicit schema 
> is provided, an out of buffer read is performed and Arrow can crash. 
> Reproducing json:
> {code:java}
> {"a":0, "a":1}{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14122) [C++] interval comparison kernels

2021-09-28 Thread QP Hou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421849#comment-17421849
 ] 

QP Hou commented on ARROW-14122:


Awesome, looks like we have a consensus here :) +1 on making arrow interval 
type unordered and introduce a totally ordered postgres interval type for 
compatibility purpose.

> QP Hou, I think that the `PartialOrd` in Rust is not the same as partial 
> order I referenced above. `PartialOrd` may return None when two types are not 
> comparable, whereas the partial order above must always return true or false. 
> However, even that is not what we want for compatibility with postgres.

I still think the ParitialOrd trait semantic matches the partial order 
definition in the wiki page you linked, specially the non-strict partial order 
definition. The only difference between non-strict partial order and total 
order is missing of the `strongly connected` rule. Lack of strong connectivity 
in plain English means an order cannot be defined for all pairs of elements 
from the set, which is exactly the semantic we have for the current arrow 
interval type. The antisymmetry rule starts with `if a<=b and b <=a`. The 
preceding `if` means it is only applicable for a, b where an order can be 
defined between them. The strong connectivity rule `a <= b or a >= b` is what 
requires order to be defined for any two pairs of elements from the set.

However, I don't think this changes our conclusion :) As I mentioned in my 
first comment and [~cpcloud] mentioned in his second comment, a correctly 
defined partial order for the arrow interval type can be very confusing to our 
end users. Partial order is useful for floating point number because it is only 
NaN that cannot have order defined, so the set of unordered elements is very 
sparse. While for  arrow's interval type, the set of unordered elements is very 
dense. This makes the partial order relation not very useful in practice. Most 
of the time you will still need to convert it to timestamp or duration with a 
reference point. So we might as well not define an order for it.

> Do we need a new JIRA/set of JIRAs to track the work of adding the extension 
> type?

I think this is best tracked as a separate ticket and link to this one.

> [C++] interval comparison kernels
> -
>
> Key: ARROW-14122
> URL: https://issues.apache.org/jira/browse/ARROW-14122
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Phillip Cloud
>Priority: Major
>  Labels: kernel
>
> Subtask for tracking interval comparison kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14160) [C++][Build] Cannot build with Parquet/Thrift support on CentOS 7

2021-09-28 Thread Rares Vernica (Jira)
Rares Vernica created ARROW-14160:
-

 Summary: [C++][Build] Cannot build with Parquet/Thrift support on 
CentOS 7
 Key: ARROW-14160
 URL: https://issues.apache.org/jira/browse/ARROW-14160
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Affects Versions: 3.0.0
Reporter: Rares Vernica


Cannot compile the C++ library for Arrow 3.0.0 in CentOS 7. It breaks if I set 
{{ARROW_PARQUET=ON}}. It stops while trying to build {{thrift_ep}}

 
{code:java}
> scl enable devtoolset-3 "cmake3 .. \
 -DARROW_PARQUET=ON  \
 -DARROW_WITH_LZ4=ON \
 -DARROW_WITH_ZLIB=ON \
 -DARROW_COMPUTE=ON \
 -DCMAKE_CXX_COMPILER=/opt/rh/devtoolset-3/root/usr/bin/g++ \
 -DCMAKE_C_COMPILER=/opt/rh/devtoolset-3/root/usr/bin/gcc \
 -DCMAKE_INSTALL_PREFIX=/opt/apache-arrow"
...
apache-arrow-3.0.0/cpp/build> make
Scanning dependencies of target jemalloc_ep
[ 1%] Creating directories for 'jemalloc_ep'
[ 1%] Performing download step (download, verify and extract) for
'jemalloc_ep'
-- jemalloc_ep download command succeeded. See also
apache-arrow-3.0.0/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-download-*.log
[ 2%] Performing patch step for 'jemalloc_ep'
[ 2%] No update step for 'jemalloc_ep'
[ 2%] Performing configure step for 'jemalloc_ep'
-- jemalloc_ep configure command succeeded. See also
apache-arrow-3.0.0/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-configure-*.log
[ 3%] Performing build step for 'jemalloc_ep'
-- jemalloc_ep build command succeeded. See also
apache-arrow-3.0.0/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-build-*.log
[ 3%] Performing install step for 'jemalloc_ep'
-- jemalloc_ep install command succeeded. See also
apache-arrow-3.0.0/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-install-*.log
[ 4%] Completed 'jemalloc_ep'
[ 4%] Built target jemalloc_ep
Scanning dependencies of target boost_ep
[ 4%] Creating directories for 'boost_ep'
[ 5%] Performing download step (download, verify and extract) for
'boost_ep'
-- boost_ep download command succeeded. See also
apache-arrow-3.0.0/cpp/build/boost_ep-prefix/src/boost_ep-stamp/boost_ep-download-*.log
[ 5%] No patch step for 'boost_ep'
[ 5%] No update step for 'boost_ep'
[ 6%] No configure step for 'boost_ep'
[ 6%] No build step for 'boost_ep'
[ 7%] No install step for 'boost_ep'
[ 7%] Completed 'boost_ep'
[ 7%] Built target boost_ep
Scanning dependencies of target thrift_ep
[ 7%] Creating directories for 'thrift_ep'
[ 7%] Performing download step (download, verify and extract) for
'thrift_ep'
-- thrift_ep download command succeeded. See also
apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-*.log
[ 7%] No patch step for 'thrift_ep'
[ 8%] No update step for 'thrift_ep'
[ 9%] Performing configure step for 'thrift_ep'
-- thrift_ep configure command succeeded. See also
apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-configure-*.log
[ 9%] Performing build step for 'thrift_ep'
-- thrift_ep build command succeeded. See also
apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-*.log
[ 10%] Performing install step for 'thrift_ep'
CMake Error at
apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-install-RELEASE.cmake:37
(message):
 Command failed: 2
'make' 'install'
See also

apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-install-*.log

-- stdout output is:
-- stderr output is:
make[3]: *** No rule to make target `install'. Stop.
CMake Error at
apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-install-RELEASE.cmake:47
(message):
 Stopping after outputting logs.

make[2]: *** [thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-install] Error
1
make[1]: *** [CMakeFiles/thrift_ep.dir/all] Error 2
make: *** [all] Error 2

> cat
apache-arrow-3.0.0/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-install-*.log
make[3]: *** No rule to make target `install'. Stop.
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11153) [C++][Packaging] Move debian/ubuntu/centos packaging off of Travis-CI

2021-09-28 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421837#comment-17421837
 ] 

Kouhei Sutou commented on ARROW-11153:
--

[~kszucs] said that we can use still use {{arm64-graviton2}} on Travis CI. 
Voltron Labs can sponsor the fee for Travsi CI.

> [C++][Packaging] Move debian/ubuntu/centos packaging off of Travis-CI
> -
>
> Key: ARROW-11153
> URL: https://issues.apache.org/jira/browse/ARROW-11153
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Blocker
> Fix For: 6.0.0
>
>
> Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11153) [C++][Packaging] Move debian/ubuntu/centos packaging off of Travis-CI

2021-09-28 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11153:
-
Fix Version/s: (was: 6.0.0)

> [C++][Packaging] Move debian/ubuntu/centos packaging off of Travis-CI
> -
>
> Key: ARROW-11153
> URL: https://issues.apache.org/jira/browse/ARROW-11153
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Packaging
>Reporter: Neal Richardson
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14076) Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)

2021-09-28 Thread Daniel Rice (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421804#comment-17421804
 ] 

Daniel Rice commented on ARROW-14076:
-

 
{code:java}
~ $ ldd /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow.so 
linux-vdso.so.1 (0x7fffbcd6f000)
libruby.so.2.7 => not found
libarrow.so.500 => /app/.apt/usr/lib/x86_64-linux-gnu/libarrow.so.500 
(0x7f8fd7389000)
libarrow-glib.so.500 => 
/app/.apt/usr/lib/x86_64-linux-gnu/libarrow-glib.so.500 (0x7f8fd7243000)
libgobject-2.0.so.0 => 
/app/.apt/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0 (0x7f8fd71e3000)

/tmp/build_29fd2902/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/ext/extpp/libruby-extpp.so
 => not found
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7f8fd6fff000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f8fd6e0d000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7f8fd6df2000)
libbrotlienc.so.1 => /lib/x86_64-linux-gnu/libbrotlienc.so.1 
(0x7f8fd6d69000)
libbrotlidec.so.1 => /lib/x86_64-linux-gnu/libbrotlidec.so.1 
(0x7f8fd6d5b000)
libutf8proc.so.2 => /app/.apt/usr/lib/x86_64-linux-gnu/libutf8proc.so.2 
(0x7f8fd6d0c000)
libre2.so.5 => /app/.apt/usr/lib/x86_64-linux-gnu/libre2.so.5 
(0x7f8fd6c9b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f8fd6c95000)
libcrypto.so.1.1 => /lib/x86_64-linux-gnu/libcrypto.so.1.1 
(0x7f8fd69bf000)
libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 
(0x7f8fd69ac000)
liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x7f8fd698b000)
libsnappy.so.1 => /app/.apt/usr/lib/x86_64-linux-gnu/libsnappy.so.1 
(0x7f8fd697e000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f8fd6962000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x7f8fd68b9000)
libcurl.so.4 => /lib/x86_64-linux-gnu/libcurl.so.4 (0x7f8fd6828000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f8fd66d9000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f8fd66b6000)
/lib64/ld-linux-x86-64.so.2 (0x7f8fd8a68000)
libglib-2.0.so.0 => /app/.apt/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 
(0x7f8fd658b000)
libgio-2.0.so.0 => /app/.apt/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0 
(0x7f8fd63aa000)
libffi.so.7 => /lib/x86_64-linux-gnu/libffi.so.7 (0x7f8fd639e000)
libbrotlicommon.so.1 => /lib/x86_64-linux-gnu/libbrotlicommon.so.1 
(0x7f8fd637b000)
libnghttp2.so.14 => /lib/x86_64-linux-gnu/libnghttp2.so.14 
(0x7f8fd6352000)
libidn2.so.0 => /lib/x86_64-linux-gnu/libidn2.so.0 (0x7f8fd632f000)
librtmp.so.1 => /lib/x86_64-linux-gnu/librtmp.so.1 (0x7f8fd630f000)
libssh.so.4 => /lib/x86_64-linux-gnu/libssh.so.4 (0x7f8fd62a1000)
libpsl.so.5 => /lib/x86_64-linux-gnu/libpsl.so.5 (0x7f8fd628e000)
libssl.so.1.1 => /lib/x86_64-linux-gnu/libssl.so.1.1 
(0x7f8fd61fb000)
libgssapi_krb5.so.2 => /lib/x86_64-linux-gnu/libgssapi_krb5.so.2 
(0x7f8fd61ae000)
libldap_r-2.4.so.2 => /lib/x86_64-linux-gnu/libldap_r-2.4.so.2 
(0x7f8fd6156000)
liblber-2.4.so.2 => /lib/x86_64-linux-gnu/liblber-2.4.so.2 
(0x7f8fd6145000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x7f8fd60d2000)
libgmodule-2.0.so.0 => 
/app/.apt/usr/lib/x86_64-linux-gnu/libgmodule-2.0.so.0 (0x7f8fd60cc000)
libmount.so.1 => /lib/x86_64-linux-gnu/libmount.so.1 
(0x7f8fd606c000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 
(0x7f8fd603f000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 
(0x7f8fd6023000)
libunistring.so.2 => /lib/x86_64-linux-gnu/libunistring.so.2 
(0x7f8fd5ea1000)
libgnutls.so.30 => /lib/x86_64-linux-gnu/libgnutls.so.30 
(0x7f8fd5ccb000)
libhogweed.so.5 => /lib/x86_64-linux-gnu/libhogweed.so.5 
(0x7f8fd5c94000)
libnettle.so.7 => /lib/x86_64-linux-gnu/libnettle.so.7 
(0x7f8fd5c5a000)
libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x7f8fd5bd4000)
libkrb5.so.3 => /lib/x86_64-linux-gnu/libkrb5.so.3 (0x7f8fd5af7000)
libk5crypto.so.3 => /lib/x86_64-linux-gnu/libk5crypto.so.3 
(0x7f8fd5ac6000)
libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 
(0x7f8fd5abf000)
libkrb5support.so.0 => /lib/x86_64-linux-gnu/libkrb5support.so.0 
(0x7f8fd5ab)
libsasl2.so.2 => /lib/x86_64-linux-gnu/libsasl2.so.2 
(0x7f8fd5a91000)
libgssapi.so.3 => /lib/x86_64-linux-gnu/libgssapi.so.3 
(0x7f8fd5a4c000)
libblkid.so.1 => /lib/x86_64-linux-gnu/libblkid.so.1 
(0x7f8fd59f5000)
libpcre2-8.so.0 => 

[jira] [Commented] (ARROW-14076) Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)

2021-09-28 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421793#comment-17421793
 ] 

Kouhei Sutou commented on ARROW-14076:
--

Could you show the output of {{ldd 
/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow.so}}?

> Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)
> 
>
> Key: ARROW-14076
> URL: https://issues.apache.org/jira/browse/ARROW-14076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 4.0.0
> Environment: Ruby 2.7.4 on Ubuntu 20.04/Heroku
>Reporter: Daniel Rice
>Priority: Major
>
>  
> Hello,
>  
> I am not able to get the Ruby gems, `red-arrow` and `red-parquet`, to work 
> properly on Heroku.  Heroku itself is merely an Ubuntu 20.04 LTS (focal) 
> container so this really is a question about what dependencies must be 
> installed to get these gems to work on Focal?
> So far I have specified the following in Heroku's `Aptfile`: 
> {code:java}
> # Get Heroku's Ubuntu distro for your Stack.  Heroku-20 = focal
> # Running bash on ⬢ ... up, run.1471 (Hobby)
> # ~ $ lsb_release --codename --short
> :repo:deb [trusted=yes arch=amd64] 
> https://apache.jfrog.io/artifactory/arrow/ubuntu/ focal mainlibarrow-dev
> libparquet-dev
> libarrow-glib-dev
> libparquet-glib-dev
> libgirepository-1.0-1
> libgirepository1.0-dev
> libglib2.0-dev
> libglib2.0-0
> gir1.2-glib-2.0
> gobject-introspection
> {code}
> Note: the above contains additional packages that were required by 
> `red-arrow` that WERE NOT SPECIFIED in the Installation guide at 
> [https://arrow.apache.org/install/.|https://arrow.apache.org/install/]
> Despite all my efforts, I am unable to solve this issue:
> {code:java}
> 2021-09-21T23:05:11.469561+00:00 heroku[worker.1]: Process exited with status 
> 1
> 2021-09-21T23:05:11.263179+00:00 app[worker.1]: bundler: failed to load 
> command: sidekiq (/app/vendor/bundle/ruby/2.7.0/bin/sidekiq)
> 2021-09-21T23:05:11.263465+00:00 app[worker.1]: 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require': 
> /tmp/build_29fd2902/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/ext/extpp/libruby-extpp.so:
>  cannot open shared object file: No such file or directory - 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow.so (LoadError)
> 2021-09-21T23:05:11.263508+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require'
> 2021-09-21T23:05:11.263521+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:112:in 
> `require_extension_library'
> 2021-09-21T23:05:11.263532+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:31:in 
> `post_load'
> 2021-09-21T23:05:11.263544+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/gobject-introspection-3.4.4/lib/gobject-introspection/loader.rb:45:in
>  `load'
> 2021-09-21T23:05:11.263565+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/gobject-introspection-3.4.4/lib/gobject-introspection/loader.rb:25:in
>  `load'
> {code}
>  What is super frustrating is that the directory, 
> `/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib`, is specified in 
> `LD_LIBRARY_PATH`, so I'm not sure why it's not being found.
> *+_Any help determining the full list of dependent packages for Ubuntu 20.04 
> (focal) would be greatly appreciated._+*  
>  
> *Extra environment details:*
>  
> Ruby 2.7.4 on Ubuntu 20.04/Heroku
>  
> *Relevant gem versions:*
> red-arrow (4.0.0)
> red-parquet (4.0.0)
> gio2 (3.4.4)
> gobject-introspection (3.4.4)
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14076) Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)

2021-09-28 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421607#comment-17421607
 ] 

Kouhei Sutou edited comment on ARROW-14076 at 9/28/21, 9:17 PM:


I'm still having the same issue.  Here's some more environmental information 
from `heroku run bash`:

 
{noformat}
Running *bash* on ⬢  ... up, run.6036 (Hobby)

*~* *$* echo $LD_LIBRARY_PATH

/app/.apt/usr/lib/x86_64-linux-gnu:/app/.apt/usr/lib/i386-linux-gnu:/app/.apt/usr/lib:/app/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/lib:/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib

*~* *$* ls -al /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib

total 972

drwx-- 3 u7677 dyno   4096 Sep 21 18:40 .

drwx-- 7 u7677 dyno   4096 Sep 21 18:40 ..

drwx-- 2 u7677 dyno   4096 Sep 21 18:40 arrow

-rw--- 1 u7677 dyno    941 Sep 21 18:40 arrow.rb

-rwx-- 1 u7677 dyno 976336 Sep 21 18:40 arrow.so

*~* *$* ls -al /app/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/lib

total 192

drwx-- 3 u7677 dyno   4096 Sep 21 18:35 .

drwx-- 8 u7677 dyno   4096 Sep 21 18:35 ..

drwx-- 2 u7677 dyno   4096 Sep 21 18:35 extpp

-rw--- 1 u7677 dyno   1049 Sep 21 18:35 extpp.rb

-rwx-- 1 u7677 dyno 177576 Sep 21 18:35 libruby-extpp.so

*~* *$*
{noformat}


was (Author: danielricecodes):
I'm still having the same issue.  Here's some more environmental information 
from `heroku run bash`:

 

Running *bash* on ⬢  ... up, run.6036 (Hobby)

*~* *$* echo $LD_LIBRARY_PATH

/app/.apt/usr/lib/x86_64-linux-gnu:/app/.apt/usr/lib/i386-linux-gnu:/app/.apt/usr/lib:/app/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/lib:/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib

*~* *$* ls -al /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib

total 972

drwx-- 3 u7677 dyno   4096 Sep 21 18:40 .

drwx-- 7 u7677 dyno   4096 Sep 21 18:40 ..

drwx-- 2 u7677 dyno   4096 Sep 21 18:40 arrow

-rw--- 1 u7677 dyno    941 Sep 21 18:40 arrow.rb

-rwx-- 1 u7677 dyno 976336 Sep 21 18:40 arrow.so

*~* *$* ls -al /app/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/lib

total 192

drwx-- 3 u7677 dyno   4096 Sep 21 18:35 .

drwx-- 8 u7677 dyno   4096 Sep 21 18:35 ..

drwx-- 2 u7677 dyno   4096 Sep 21 18:35 extpp

-rw--- 1 u7677 dyno   1049 Sep 21 18:35 extpp.rb

-rwx-- 1 u7677 dyno 177576 Sep 21 18:35 libruby-extpp.so

*~* *$*

> Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)
> 
>
> Key: ARROW-14076
> URL: https://issues.apache.org/jira/browse/ARROW-14076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 4.0.0
> Environment: Ruby 2.7.4 on Ubuntu 20.04/Heroku
>Reporter: Daniel Rice
>Priority: Major
>
>  
> Hello,
>  
> I am not able to get the Ruby gems, `red-arrow` and `red-parquet`, to work 
> properly on Heroku.  Heroku itself is merely an Ubuntu 20.04 LTS (focal) 
> container so this really is a question about what dependencies must be 
> installed to get these gems to work on Focal?
> So far I have specified the following in Heroku's `Aptfile`: 
> {code:java}
> # Get Heroku's Ubuntu distro for your Stack.  Heroku-20 = focal
> # Running bash on ⬢ ... up, run.1471 (Hobby)
> # ~ $ lsb_release --codename --short
> :repo:deb [trusted=yes arch=amd64] 
> https://apache.jfrog.io/artifactory/arrow/ubuntu/ focal mainlibarrow-dev
> libparquet-dev
> libarrow-glib-dev
> libparquet-glib-dev
> libgirepository-1.0-1
> libgirepository1.0-dev
> libglib2.0-dev
> libglib2.0-0
> gir1.2-glib-2.0
> gobject-introspection
> {code}
> Note: the above contains additional packages that were required by 
> `red-arrow` that WERE NOT SPECIFIED in the Installation guide at 
> [https://arrow.apache.org/install/.|https://arrow.apache.org/install/]
> Despite all my efforts, I am unable to solve this issue:
> {code:java}
> 2021-09-21T23:05:11.469561+00:00 heroku[worker.1]: Process exited with status 
> 1
> 2021-09-21T23:05:11.263179+00:00 app[worker.1]: bundler: failed to load 
> command: sidekiq (/app/vendor/bundle/ruby/2.7.0/bin/sidekiq)
> 2021-09-21T23:05:11.263465+00:00 app[worker.1]: 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require': 
> /tmp/build_29fd2902/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/ext/extpp/libruby-extpp.so:
>  cannot open shared object file: No such file or directory - 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow.so (LoadError)
> 2021-09-21T23:05:11.263508+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require'
> 2021-09-21T23:05:11.263521+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:112:in 
> `require_extension_library'
> 

[jira] [Commented] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-28 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421775#comment-17421775
 ] 

David Li commented on ARROW-13130:
--

I think we can handle most of these by casting to float64 first (if we want, we 
could avoid the cast and do the conversion inline instead). abs, round (and 
ceil, floor, etc), mode, negate, sign, and tdigest should be handled directly.

> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - round (ARROW-13975)
>  - round_to_multiple (ARROW-13975)
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - equal (ARROW-13966)
>  - greater (ARROW-13966)
>  - greater_equal (ARROW-13966)
>  - less (ARROW-13966)
>  - less_equal (ARROW-13966)
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-28 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421775#comment-17421775
 ] 

David Li edited comment on ARROW-13130 at 9/28/21, 8:36 PM:


I think we can handle most of these by casting to float64 first (if we want, we 
could avoid the cast and do the conversion inline instead). abs, round (and 
ceil, floor, etc), mode, negate, and sign should be handled directly.


was (Author: lidavidm):
I think we can handle most of these by casting to float64 first (if we want, we 
could avoid the cast and do the conversion inline instead). abs, round (and 
ceil, floor, etc), mode, negate, sign, and tdigest should be handled directly.

> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - round (ARROW-13975)
>  - round_to_multiple (ARROW-13975)
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - equal (ARROW-13966)
>  - greater (ARROW-13966)
>  - greater_equal (ARROW-13966)
>  - less (ARROW-13966)
>  - less_equal (ARROW-13966)
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-28 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421775#comment-17421775
 ] 

David Li edited comment on ARROW-13130 at 9/28/21, 8:36 PM:


I think we can handle most of these by casting to float64 first (if we want, we 
could avoid the cast and do the conversion inline instead). abs, round (and 
ceil, floor, etc), mode, negate, quantile, and sign should be handled directly.


was (Author: lidavidm):
I think we can handle most of these by casting to float64 first (if we want, we 
could avoid the cast and do the conversion inline instead). abs, round (and 
ceil, floor, etc), mode, negate, and sign should be handled directly.

> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - round (ARROW-13975)
>  - round_to_multiple (ARROW-13975)
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - equal (ARROW-13966)
>  - greater (ARROW-13966)
>  - greater_equal (ARROW-13966)
>  - less (ARROW-13966)
>  - less_equal (ARROW-13966)
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13130:


Assignee: David Li

> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - round (ARROW-13975)
>  - round_to_multiple (ARROW-13975)
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - equal (ARROW-13966)
>  - greater (ARROW-13966)
>  - greater_equal (ARROW-13966)
>  - less (ARROW-13966)
>  - less_equal (ARROW-13966)
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13130) [C++][Compute] Add decimal support for arithmetic compute functions

2021-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13130:
-
Description: 
The following arithmetic functions do not support decimal:
 - abs
 - abs_checked
 - acos
 - acos_checked
 - asin
 - asin_checked
 - atan
 - ceil
 - cos
 - cos_checked
 - floor
 - is_finite (?)
 - is_inf (?)
 - is_nan (?)
 - ln
 - ln_checked
 - log1p
 - log1p_checked
 - log2
 - log2_checked
 - logb (float/decimal works int/decimal does not)
 - logb_checked (float/decimal works int/decimal does not)
 - mode
 - negate
 - negate_checked
 - power (float/decimal works int/decimal does not)
 - power_checked (float/decimal works int/decimal does not)
 - quantile
 - round (ARROW-13975)
 - round_to_multiple (ARROW-13975)
 - sign
 - sin
 - sin_checked
 - stddev
 - tan
 - tan_checked
 - tdigest
 - trunc
 - variance

? - May not be applicable

The following kernels arithmetic functions do support decimal inputs
 - add
 - add_checked
 - atan2
 - divide
 - divide_checked
 - equal (ARROW-13966)
 - greater (ARROW-13966)
 - greater_equal (ARROW-13966)
 - less (ARROW-13966)
 - less_equal (ARROW-13966)
 - mean
 - min_max
 - multiply
 - multiply_checked
 - product
 - subtract
 - subtract_checked
 - sum
 - unique

  was:
The following arithmetic functions do not support decimal:
 - abs
 - abs_checked
 - acos
 - acos_checked
 - asin
 - asin_checked
 - atan
 - ceil
 - cos
 - cos_checked
 - floor
 - is_finite (?)
 - is_inf (?)
 - is_nan (?)
 - ln
 - ln_checked
 - log1p
 - log1p_checked
 - log2
 - log2_checked
 - logb (float/decimal works int/decimal does not)
 - logb_checked (float/decimal works int/decimal does not)
 - mode
 - negate
 - negate_checked
 - power (float/decimal works int/decimal does not)
 - power_checked (float/decimal works int/decimal does not)
 - quantile
 - sign
 - sin
 - sin_checked
 - stddev
 - tan
 - tan_checked
 - tdigest
 - trunc
 - variance

? - May not be applicable

The following kernels arithmetic functions do support decimal inputs
 - add
 - add_checked
 - atan2
 - divide
 - divide_checked
 - equal (ARROW-13966)
 - greater (ARROW-13966)
 - greater_equal (ARROW-13966)
 - less (ARROW-13966)
 - less_equal (ARROW-13966)
 - mean
 - min_max
 - multiply
 - multiply_checked
 - product
 - subtract
 - subtract_checked
 - sum
 - unique


> [C++][Compute] Add decimal support for arithmetic compute functions
> ---
>
> Key: ARROW-13130
> URL: https://issues.apache.org/jira/browse/ARROW-13130
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: yibocai#1
>Priority: Major
>  Labels: kernel
>
> The following arithmetic functions do not support decimal:
>  - abs
>  - abs_checked
>  - acos
>  - acos_checked
>  - asin
>  - asin_checked
>  - atan
>  - ceil
>  - cos
>  - cos_checked
>  - floor
>  - is_finite (?)
>  - is_inf (?)
>  - is_nan (?)
>  - ln
>  - ln_checked
>  - log1p
>  - log1p_checked
>  - log2
>  - log2_checked
>  - logb (float/decimal works int/decimal does not)
>  - logb_checked (float/decimal works int/decimal does not)
>  - mode
>  - negate
>  - negate_checked
>  - power (float/decimal works int/decimal does not)
>  - power_checked (float/decimal works int/decimal does not)
>  - quantile
>  - round (ARROW-13975)
>  - round_to_multiple (ARROW-13975)
>  - sign
>  - sin
>  - sin_checked
>  - stddev
>  - tan
>  - tan_checked
>  - tdigest
>  - trunc
>  - variance
> ? - May not be applicable
> The following kernels arithmetic functions do support decimal inputs
>  - add
>  - add_checked
>  - atan2
>  - divide
>  - divide_checked
>  - equal (ARROW-13966)
>  - greater (ARROW-13966)
>  - greater_equal (ARROW-13966)
>  - less (ARROW-13966)
>  - less_equal (ARROW-13966)
>  - mean
>  - min_max
>  - multiply
>  - multiply_checked
>  - product
>  - subtract
>  - subtract_checked
>  - sum
>  - unique



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13948) [C++] index_in/is_in kernels missing support for timestamp with timezone

2021-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13948:
-
Labels: good-first-issue kernel query-engine  (was: kernel query-engine)

> [C++] index_in/is_in kernels missing support for timestamp with timezone
> 
>
> Key: ARROW-13948
> URL: https://issues.apache.org/jira/browse/ARROW-13948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-first-issue, kernel, query-engine
>
> The index_in and is_in kernels should support all equatable value types.  At 
> the moment it supports all except for timestamp types that have a timezone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-14112) [C++] index_in / is_in missing support for timestamps with a time zone

2021-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-14112.

Resolution: Duplicate

> [C++] index_in / is_in missing support for timestamps with a time zone
> --
>
> Key: ARROW-14112
> URL: https://issues.apache.org/jira/browse/ARROW-14112
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-second-issue, kernel, query-engine
>
> The index_in / is_in functions should only require the value type be 
> equatable.  Timestamps with a time zone are equatable and therefore we should 
> support these types.
> Since we do support timestamps without a time zone I suspect this was an 
> oversight and not a real limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13950) [C++] min_element_wise/max_element_wise missing support for some types

2021-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13950:
-
Labels: good-second-issue kernel query-engine  (was: kernel query-engine)

> [C++] min_element_wise/max_element_wise missing support for some types
> --
>
> Key: ARROW-13950
> URL: https://issues.apache.org/jira/browse/ARROW-13950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-second-issue, kernel, query-engine
>
> The min/max element wise kernels should support all sortable types.  
> Currently support is missing for:
>  - decimal
>  - null
>  - binary
>  - large_binary
>  - fixed_size_binary
>  - string
>  - large_string



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-14113) [C++] max_element_wise / min_element_wise does not support binary

2021-09-28 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-14113.

Resolution: Duplicate

> [C++] max_element_wise / min_element_wise does not support binary
> -
>
> Key: ARROW-14113
> URL: https://issues.apache.org/jira/browse/ARROW-14113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-second-issue, kernel, query-engine
>
> In general, in other kernels, we consider binary types to be comparable (e.g. 
> we can do sort or top_k on binary types).  Given this, min_element_wise and 
> max_element_wise should support binary types (i.e. binary, large_binary, 
> fixed_size_binary, string, large_string)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421666#comment-17421666
 ] 

Neal Richardson commented on ARROW-8379:


I created ARROW-14159 to explore more mulithreading controls after 6.0.0.

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 6.0.0
>
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14159) [R] Re-allow some multithreading on Windows

2021-09-28 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-14159:
---

 Summary: [R] Re-allow some multithreading on Windows
 Key: ARROW-14159
 URL: https://issues.apache.org/jira/browse/ARROW-14159
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 7.0.0


Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
discussion about adding more controls, disabling threading in some places and 
not others, etc. We want to do this soon after release so that we have a few 
months to see how things behave on CI before releasing again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8379:
--

Assignee: Neal Richardson

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8379:
---
Fix Version/s: 6.0.0

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 6.0.0
>
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13825) [C++][Compute] Add string translate kernel

2021-09-28 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-13825:
--
Fix Version/s: 7.0.0

> [C++][Compute] Add string translate kernel
> --
>
> Key: ARROW-13825
> URL: https://issues.apache.org/jira/browse/ARROW-13825
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
> Fix For: 7.0.0
>
>
> Create compute function for string translate, similar to [Python's 
> str.translate|https://docs.python.org/3/library/stdtypes.html#str.translate], 
> [SQL TRANSLATE|https://www.w3schools.com/sqL/func_sqlserver_translate.asp], 
> and 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421635#comment-17421635
 ] 

Neal Richardson commented on ARROW-8379:


> Dataset scanning, etc. has been running in a multi threaded fashion on CI 
> rather reliably on RTools >= 4.

Actually it hasn't been--we set use_threads = FALSE on Windows in the test 
setup. I'm essentially proposing to move this to the package .onLoad hook, 
which means that users will experience the (relative) stability we see on CI.

How about we do that for 6.0, and then after 6.0 we explore more controls for 
multithreading. That way, we'll have time to see how it behaves on CI for a 
couple of months before we release next.

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421634#comment-17421634
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Related https://issues.apache.org/jira/browse/ARROW-14158

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14158) [C++][Compute] Implement count distinct kernel using HyperLogLog

2021-09-28 Thread Jira
Percy Camilo Triveño Aucahuasi created ARROW-14158:
--

 Summary: [C++][Compute] Implement count distinct kernel using 
HyperLogLog
 Key: ARROW-14158
 URL: https://issues.apache.org/jira/browse/ARROW-14158
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 7.0.0
Reporter: Percy Camilo Triveño Aucahuasi


Having a version of the aggregation kernel count distinct using HyperLogLog may 
be useful.

Note: The implementation should support the merge operator.

cc [~icook] [~lidavidm]

Some resources/links:
[http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
[https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/]
[https://github.com/facebookincubator/velox/tree/main/velox/aggregates/hyperloglog]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Rares Vernica (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421632#comment-17421632
 ] 

Rares Vernica commented on ARROW-14152:
---

Two comments:
 # On my CentOS 7 setup I fist install cmake3 and devtoolset-3-toolchain. This 
give me gcc 4.9.2
 # My original issue was this URL 
https://github.com/boostorg/boost/archive/boost-${ARROW_BOOST_BUILD_VERSION}.tar.gz
 used in ThirdpartyToolchain.cmake 
[https://github.com/apache/arrow/blob/ef4e92982054fcc723729ab968296d799d3108dd/cpp/cmake_modules/ThirdpartyToolchain.cmake#L405]
 This URL is wrong and does not provide the Boost headers as expected.

So, the original issue I raised on the mailing list 
[https://lists.apache.org/thread.html/rde68810f24ec81280860cfc9069b29e9df23b934164a34a0a4898ed2%40%3Cdev.arrow.apache.org%3E]
 can be easily fixed by removing or updating the URL in line 
ThirdpartyToolcahin.cmake line 405

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13853) [R] String to_title, to_lower, to_upper kernels

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421631#comment-17421631
 ] 

Neal Richardson commented on ARROW-13853:
-

I made a suggestion on the PR

> [R] String to_title, to_lower, to_upper kernels
> ---
>
> Key: ARROW-13853
> URL: https://issues.apache.org/jira/browse/ARROW-13853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we 
> should add a test. Also the stringr function takes a "locale" argument which 
> is not handled here; we should either pass it to Arrow C++ if it supports it 
> (which I doubt) or error if a value is provided in R.
> This also applies to *str_to_lower* and *str_to_upper* kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14036) [R] Binding for n_distinct() with no grouping

2021-09-28 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook reassigned ARROW-14036:


Assignee: Ian Cook

> [R] Binding for n_distinct() with no grouping
> -
>
> Key: ARROW-14036
> URL: https://issues.apache.org/jira/browse/ARROW-14036
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> ARROW-13620 added a binding for {{n_distinct()}} but it only works for 
> _grouped_ aggregation, not whole-table aggregation. 
> This works:
> {code:java}
> Table$create(starwars) %>%
>   group_by(homeworld) %>%
>   summarise(n_distinct(species)) %>%
>   collect(){code}
> but this errors:
> {code:java}
> Table$create(starwars) %>%
>   summarise(n_distinct(species)) %>%
>   collect()
> #> Error: Key error: No function registered with name: count_distinct{code}
> Once we have a non-hash {{count_distinct}} aggregate kernel in the C++ 
> library (ARROW-14035) we should bind the options for it in the R package and 
> add a test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13975) [C++][Compute] Add decimal support to round functions

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13975:

Component/s: C++

> [C++][Compute] Add decimal support to round functions
> -
>
> Key: ARROW-13975
> URL: https://issues.apache.org/jira/browse/ARROW-13975
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> Need to add Decimal support to the rounding compute functions. The PR for 
> adding round compute functions (ARROW-12744) only includes basic arithmetic 
> types (unsigned/signed int and floating-point).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14126) [C++] Add locale support for relevant string compute functions

2021-09-28 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-14126:
--
Description: 
String compute functions do not make use of locale settings for case changing 
transformations, string comparisons, and number to string casting. Arrow does 
provides UTF-8 string kernels to handle localization standardization. It would 
be good to add locale support for string kernels that are affected by it.

The following are string functions that take a `locale` option as its second 
argument:
* str_to_lower
* str_to_upper
* str_to_title

  was:String compute functions do not make use of locale settings for case 
changing transformations, string comparisons, and number to string casting. 
Arrow does provides UTF-8 string kernels to handle localization 
standardization. It would be good to add locale support for string kernels that 
are affected by it.


> [C++] Add locale support for relevant string compute functions
> --
>
> Key: ARROW-14126
> URL: https://issues.apache.org/jira/browse/ARROW-14126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Priority: Major
>  Labels: kernel
> Fix For: 7.0.0
>
>
> String compute functions do not make use of locale settings for case changing 
> transformations, string comparisons, and number to string casting. Arrow does 
> provides UTF-8 string kernels to handle localization standardization. It 
> would be good to add locale support for string kernels that are affected by 
> it.
> The following are string functions that take a `locale` option as its second 
> argument:
> * str_to_lower
> * str_to_upper
> * str_to_title



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421615#comment-17421615
 ] 

Weston Pace commented on ARROW-8379:


> On 32-bit Rtools 3.5, async multithreading just does not work. This means we 
> should disable the dataset features entirely on this build.

+1.  When I dug into this in detail I believe I found that mingw on 32-bit 
windows is using a rather custom backport / bespoke implementation of pthreads 
and there were other issues where it does not properly implement the 
std::thread expectations.  I'd really rather not dig into this.

> Multithreaded conversion from Arrow to R is prone to issues on Windows across 
> the board, regardless of rtools version or 32/64bits. We should set 
> option.use_threads = FALSE on Windows in .onLoad.
> This will have the side effect of disabling multithreading in some other 
> parts of C++ code where use_threads is an option. It is not clear that that 
> is strictly required, but it will be a side effect unless we distinguish the 
> use_threads controls we expose.

+1 on additional controls.  I don't think we should abandon threading on 
Windows entirely.  Dataset scanning, etc. has been running in a multi threaded 
fashion on CI rather reliably on RTools >= 4.

> There may be more work to be done with CPU and IO threadpools, which get used 
> internally in Arrow C++, but I think it might be best to release with these 
> fixes and see if we still get error reports.

+1 to fix the issues as they come.  Again, I don't want to abandon threading on 
Windows entirely.

> Relatedly, an alternative to setting use_threads = FALSE globally would be to 
> leave multithreading on but reduce the size of the CPU and IO threadpools; 
> some reports suggest that setting them to less than the number of CPUs allow 
> them to work. It's not clear though whether this fixes the problem or just 
> decreases the frequency of deadlock.

We currently set the I/O thread pool to 8 and the CPU thread pool to "# of 
cores".  There are API methods to set the size of each pool.  Threads are 
created lazily so as long as you call these methods early enough you won't have 
too many threads.  My hunch would be that we are in the "decreases the 
frequency of deadlock" camp myself though.

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14076) Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)

2021-09-28 Thread Daniel Rice (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421607#comment-17421607
 ] 

Daniel Rice commented on ARROW-14076:
-

I'm still having the same issue.  Here's some more environmental information 
from `heroku run bash`:

 

Running *bash* on ⬢  ... up, run.6036 (Hobby)

*~* *$* echo $LD_LIBRARY_PATH

/app/.apt/usr/lib/x86_64-linux-gnu:/app/.apt/usr/lib/i386-linux-gnu:/app/.apt/usr/lib:/app/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/lib:/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib

*~* *$* ls -al /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib

total 972

drwx-- 3 u7677 dyno   4096 Sep 21 18:40 .

drwx-- 7 u7677 dyno   4096 Sep 21 18:40 ..

drwx-- 2 u7677 dyno   4096 Sep 21 18:40 arrow

-rw--- 1 u7677 dyno    941 Sep 21 18:40 arrow.rb

-rwx-- 1 u7677 dyno 976336 Sep 21 18:40 arrow.so

*~* *$* ls -al /app/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/lib

total 192

drwx-- 3 u7677 dyno   4096 Sep 21 18:35 .

drwx-- 8 u7677 dyno   4096 Sep 21 18:35 ..

drwx-- 2 u7677 dyno   4096 Sep 21 18:35 extpp

-rw--- 1 u7677 dyno   1049 Sep 21 18:35 extpp.rb

-rwx-- 1 u7677 dyno 177576 Sep 21 18:35 libruby-extpp.so

*~* *$*

> Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)
> 
>
> Key: ARROW-14076
> URL: https://issues.apache.org/jira/browse/ARROW-14076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 4.0.0
> Environment: Ruby 2.7.4 on Ubuntu 20.04/Heroku
>Reporter: Daniel Rice
>Priority: Major
>
>  
> Hello,
>  
> I am not able to get the Ruby gems, `red-arrow` and `red-parquet`, to work 
> properly on Heroku.  Heroku itself is merely an Ubuntu 20.04 LTS (focal) 
> container so this really is a question about what dependencies must be 
> installed to get these gems to work on Focal?
> So far I have specified the following in Heroku's `Aptfile`: 
> {code:java}
> # Get Heroku's Ubuntu distro for your Stack.  Heroku-20 = focal
> # Running bash on ⬢ ... up, run.1471 (Hobby)
> # ~ $ lsb_release --codename --short
> :repo:deb [trusted=yes arch=amd64] 
> https://apache.jfrog.io/artifactory/arrow/ubuntu/ focal mainlibarrow-dev
> libparquet-dev
> libarrow-glib-dev
> libparquet-glib-dev
> libgirepository-1.0-1
> libgirepository1.0-dev
> libglib2.0-dev
> libglib2.0-0
> gir1.2-glib-2.0
> gobject-introspection
> {code}
> Note: the above contains additional packages that were required by 
> `red-arrow` that WERE NOT SPECIFIED in the Installation guide at 
> [https://arrow.apache.org/install/.|https://arrow.apache.org/install/]
> Despite all my efforts, I am unable to solve this issue:
> {code:java}
> 2021-09-21T23:05:11.469561+00:00 heroku[worker.1]: Process exited with status 
> 1
> 2021-09-21T23:05:11.263179+00:00 app[worker.1]: bundler: failed to load 
> command: sidekiq (/app/vendor/bundle/ruby/2.7.0/bin/sidekiq)
> 2021-09-21T23:05:11.263465+00:00 app[worker.1]: 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require': 
> /tmp/build_29fd2902/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/ext/extpp/libruby-extpp.so:
>  cannot open shared object file: No such file or directory - 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow.so (LoadError)
> 2021-09-21T23:05:11.263508+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require'
> 2021-09-21T23:05:11.263521+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:112:in 
> `require_extension_library'
> 2021-09-21T23:05:11.263532+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:31:in 
> `post_load'
> 2021-09-21T23:05:11.263544+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/gobject-introspection-3.4.4/lib/gobject-introspection/loader.rb:45:in
>  `load'
> 2021-09-21T23:05:11.263565+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/gobject-introspection-3.4.4/lib/gobject-introspection/loader.rb:25:in
>  `load'
> {code}
>  What is super frustrating is that the directory, 
> `/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib`, is specified in 
> `LD_LIBRARY_PATH`, so I'm not sure why it's not being found.
> *+_Any help determining the full list of dependent packages for Ubuntu 20.04 
> (focal) would be greatly appreciated._+*  
>  
> *Extra environment details:*
>  
> Ruby 2.7.4 on Ubuntu 20.04/Heroku
>  
> *Relevant gem versions:*
> red-arrow (4.0.0)
> red-parquet (4.0.0)
> gio2 (3.4.4)
> gobject-introspection (3.4.4)
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421606#comment-17421606
 ] 

Jonathan Keane commented on ARROW-8379:
---

Overall this sounds good and in line with what I've been experiencing. A few 
comments on specific points (in no particular order):

> just decreases the frequency of deadlock.

I suspect this is true — with my work on DuckDB I had a similar experience 
where fewer threads meant fewer opportunities to deadlock (and not that it was 
fixed by some number greater than 1). 

> On 32-bit Rtools 3.5 disabling datasets

I don't have a better solution, but part of me wonders if there's much of a 
point to install Arrow if one doesn't have datasets enabled (this would include 
basically any dplyr on tables as well, right?) Maybe it's time to end support 
for this (though that's earlier than I think is typical / we strive for!). At 
least, maybe we should include a message that says something like "32bit 
windows 3.5 has substantially reduced functionality, please consider upgrading 
your R to take full advantage of..."

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14076) Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)

2021-09-28 Thread Daniel Rice (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421601#comment-17421601
 ] 

Daniel Rice commented on ARROW-14076:
-

I will try that.

> Unable to use `red-arrow` gem on Heroku/Ubuntu 20.04 (focal)
> 
>
> Key: ARROW-14076
> URL: https://issues.apache.org/jira/browse/ARROW-14076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 4.0.0
> Environment: Ruby 2.7.4 on Ubuntu 20.04/Heroku
>Reporter: Daniel Rice
>Priority: Major
>
>  
> Hello,
>  
> I am not able to get the Ruby gems, `red-arrow` and `red-parquet`, to work 
> properly on Heroku.  Heroku itself is merely an Ubuntu 20.04 LTS (focal) 
> container so this really is a question about what dependencies must be 
> installed to get these gems to work on Focal?
> So far I have specified the following in Heroku's `Aptfile`: 
> {code:java}
> # Get Heroku's Ubuntu distro for your Stack.  Heroku-20 = focal
> # Running bash on ⬢ ... up, run.1471 (Hobby)
> # ~ $ lsb_release --codename --short
> :repo:deb [trusted=yes arch=amd64] 
> https://apache.jfrog.io/artifactory/arrow/ubuntu/ focal mainlibarrow-dev
> libparquet-dev
> libarrow-glib-dev
> libparquet-glib-dev
> libgirepository-1.0-1
> libgirepository1.0-dev
> libglib2.0-dev
> libglib2.0-0
> gir1.2-glib-2.0
> gobject-introspection
> {code}
> Note: the above contains additional packages that were required by 
> `red-arrow` that WERE NOT SPECIFIED in the Installation guide at 
> [https://arrow.apache.org/install/.|https://arrow.apache.org/install/]
> Despite all my efforts, I am unable to solve this issue:
> {code:java}
> 2021-09-21T23:05:11.469561+00:00 heroku[worker.1]: Process exited with status 
> 1
> 2021-09-21T23:05:11.263179+00:00 app[worker.1]: bundler: failed to load 
> command: sidekiq (/app/vendor/bundle/ruby/2.7.0/bin/sidekiq)
> 2021-09-21T23:05:11.263465+00:00 app[worker.1]: 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require': 
> /tmp/build_29fd2902/vendor/bundle/ruby/2.7.0/gems/extpp-0.0.9/ext/extpp/libruby-extpp.so:
>  cannot open shared object file: No such file or directory - 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow.so (LoadError)
> 2021-09-21T23:05:11.263508+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/zeitwerk-2.4.2/lib/zeitwerk/kernel.rb:34:in
>  `require'
> 2021-09-21T23:05:11.263521+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:112:in 
> `require_extension_library'
> 2021-09-21T23:05:11.263532+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib/arrow/loader.rb:31:in 
> `post_load'
> 2021-09-21T23:05:11.263544+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/gobject-introspection-3.4.4/lib/gobject-introspection/loader.rb:45:in
>  `load'
> 2021-09-21T23:05:11.263565+00:00 app[worker.1]: from 
> /app/vendor/bundle/ruby/2.7.0/gems/gobject-introspection-3.4.4/lib/gobject-introspection/loader.rb:25:in
>  `load'
> {code}
>  What is super frustrating is that the directory, 
> `/app/vendor/bundle/ruby/2.7.0/gems/red-arrow-4.0.0/lib`, is specified in 
> `LD_LIBRARY_PATH`, so I'm not sure why it's not being found.
> *+_Any help determining the full list of dependent packages for Ubuntu 20.04 
> (focal) would be greatly appreciated._+*  
>  
> *Extra environment details:*
>  
> Ruby 2.7.4 on Ubuntu 20.04/Heroku
>  
> *Relevant gem versions:*
> red-arrow (4.0.0)
> red-parquet (4.0.0)
> gio2 (3.4.4)
> gobject-introspection (3.4.4)
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14107) [R][CI] Parallelize Windows CI jobs

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14107.
-
Resolution: Fixed

Issue resolved by pull request 11221
[https://github.com/apache/arrow/pull/11221]

> [R][CI] Parallelize Windows CI jobs
> ---
>
> Key: ARROW-14107
> URL: https://issues.apache.org/jira/browse/ARROW-14107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We've been having some timeouts lately. The build is long because it builds 
> everything 2 or 3 times, and some of that could be parallelized with a 
> different GHA workflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14141) [IR] [C++] Join missing from RelationImpl

2021-09-28 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-14141.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11242
[https://github.com/apache/arrow/pull/11242]

> [IR] [C++] Join missing from RelationImpl
> -
>
> Key: ARROW-14141
> URL: https://issues.apache.org/jira/browse/ARROW-14141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Compute IR
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{Join}} is missing from {{RelationImpl}} per 
> https://github.com/duckdb/duckdb/pull/2331.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12763) [R] Optimize dplyr queries that use head/tail after arrange

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421598#comment-17421598
 ] 

Neal Richardson commented on ARROW-12763:
-

See discussion on ARROW-13767. I think this means we should use the top_k node 
when that is added (ARROW-13973)

> [R] Optimize dplyr queries that use head/tail after arrange
> ---
>
> Key: ARROW-12763
> URL: https://issues.apache.org/jira/browse/ARROW-12763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>
> Use the Arrow C++ function {{partition_nth_indices}} to optimize dplyr 
> queries like this:
> {code:r}
> iris %>%
>   Table$create() %>% 
>   arrange(desc(Sepal.Length)) %>%
>   head(10) %>%
>   collect()
> {code}
> This query sorts the full table even though it doesn't need to. It could use 
> {{partition_nth_indices}} to find the rows containing the top 10 values of 
> {{Sepal.Length}} and only collect and sort those 10 rows.
> Test to see if this improves performance in practice on larger data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421595#comment-17421595
 ] 

Neal Richardson commented on ARROW-8379:


Surveying the open issues (now linked here), it looks like:

* On 32-bit Rtools 3.5, async multithreading just does not work. This means we 
should disable the dataset features entirely on this build. It does not appear 
that we can conditionally disable ARROW_DATASET only on this build, based on 
how configure.win works, so we should check this in the R code. Make 
arrow_with_dataset() check os, R version, and arch, which will get tests and 
examples to skip appropriately, and then we could check that inside 
Dataset$create() and error informatively to prevent users on this setup from 
trying. 
* Multithreaded conversion from Arrow to R is prone to issues on Windows across 
the board, regardless of rtools version or 32/64bits. We should set 
option.use_threads = FALSE on Windows in .onLoad. 
* This will have the side effect of disabling multithreading in some other 
parts of C++ code where use_threads is an option. It is not clear that that is 
strictly required, but it will be a side effect unless we distinguish the 
use_threads controls we expose.
* There may be more work to be done with CPU and IO threadpools, which get used 
internally in Arrow C++, but I think it might be best to release with these 
fixes and see if we still get error reports. 
* Relatedly, an alternative to setting use_threads = FALSE globally would 
be to leave multithreading on but reduce the size of the CPU and IO 
threadpools; some reports suggest that setting them to less than the number of 
CPUs allow them to work. It's not clear though whether this fixes the problem 
or just decreases the frequency of deadlock. 

> [R] Investigate/fix thread safety issues (esp. Windows)
> ---
>
> Key: ARROW-8379
> URL: https://issues.apache.org/jira/browse/ARROW-8379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> There have been a number of issues where the R bindings' multithreading has 
> been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 
> I disabled {{use_threads}} in the Windows tests, and it appeared that the 
> mysterious Windows segfaults stopped. We should fix whatever the underlying 
> issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13809) [C ABI] Add support for Month, Day, Nanosecond interval type to C-ABI

2021-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13809.

Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11177
[https://github.com/apache/arrow/pull/11177]

> [C ABI] Add support for Month, Day, Nanosecond interval type to C-ABI
> -
>
> Key: ARROW-13809
> URL: https://issues.apache.org/jira/browse/ARROW-13809
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/10177] has been merged we should 
> support transport of the new type via the C ABI bindings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13876) [C++] Uniform null handling in compute functions

2021-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13876.

Resolution: Fixed

Issue resolved by pull request 11255
[https://github.com/apache/arrow/pull/11255]

> [C++] Uniform null handling in compute functions
> 
>
> Key: ARROW-13876
> URL: https://issues.apache.org/jira/browse/ARROW-13876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The compute functions today have mixed support for null types.
> Unary arithmetic functions (e.g. abs) don't support null arrays
> Binary arithmetic functions (e.g. add) support one null array (e.g. int32 + 
> null) but not both null arrays (i.e. null + null) but they do support both 
> values being null (e.g. [null] + [null] = [null] if dtype=int32 but not 
> supported if dtype=null)
> sort_indices should support null arrays.
> Some functions do forward null arrays:
>  - unique
> Some functions output a non-null type given null inputs
> - is_null (=> boolean)
> - is_valid (=> boolean)
> - value_counts (=> struct)
> - dictionary_encode (=> dictionary)
> - count (=> int64)
> Some functions throw an error other than "not implemented"
>  - list_parent_indices



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10898) [C++] Investigate Table sort performance

2021-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10898:
---
Labels: pull-request-available  (was: )

> [C++] Investigate Table sort performance
> 
>
> Key: ARROW-10898
> URL: https://issues.apache.org/jira/browse/ARROW-10898
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a followup to ARROW-10796, it may be desirable to reimplement Table 
> sorting so as to first sort individual batches, then merge sort them together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10898) [C++] Investigate Table sort performance

2021-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10898:
---
Fix Version/s: 6.0.0

> [C++] Investigate Table sort performance
> 
>
> Key: ARROW-10898
> URL: https://issues.apache.org/jira/browse/ARROW-10898
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> As a followup to ARROW-10796, it may be desirable to reimplement Table 
> sorting so as to first sort individual batches, then merge sort them together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13618) [R] Use Arrow engine for summarize() by default

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13618.
-
Resolution: Done

All linked tasks have been completed :tada:

> [R] Use Arrow engine for summarize() by default  
> -
>
> Key: ARROW-13618
> URL: https://issues.apache.org/jira/browse/ARROW-13618
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Critical
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine 
> but kept this off by default, controlled by the {{arrow.debug}} option.
> Before this can be turned on by default, we should ensure that the following 
> are all implemented:
>  * a sufficient set of hash aggregate kernels and R aggregate function 
> mappings to them, covering the vast majority of all aggregate functions that 
> dplyr users call in {{summarise()}} (add any additional required ones to 
> ARROW-13339)
>  * support for a sufficient set of data types in aggregates
>  * support for a sufficient set of data types in grouping columns
>  * handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}} 
> option consistent with base R and dplyr (ARROW-13497 and possibly other 
> issues)
>  * handling of {{NA}} and {{NaN}} values in grouping columns consistent with 
> dplyr
>  * handling empty or bad input to {{summarise()}} (ARROW-13543)
>  * many new tests to confirm equivalent results from a variety of 
> {{group_by() %>% summarise()}} queries on data frames and on Arrow data
>  * resolution of various related bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Benson Muite (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421580#comment-17421580
 ] 

Benson Muite edited comment on ARROW-14152 at 9/28/21, 6:01 PM:


Current default compiler suite on Ubuntu 21.04 is gcc 10, which is what is used 
in the CI tests. Tried this as below:

{{yum update}}
{{yum install gcc-c++ gcc bison flex git python3 make openssl-devel wget}}
 yum groupinstall "Development tools"

{color:#00}yum install centos-release-scl {color}
 yum install yum-config-manager 
 yum install devtoolset-10
 scl enable devtoolset-10 bash
 
 wget 
[https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3.tar.gz]
 tar -xvf cmake-3.21.3.tar.gz 
 cd cmake-3.21.3
 mkdir build
 cd build/
 ../bootstrap --prefix=$HOME/cmake-3.21.3-install

{{make}}
{{make install}}

{{cd ..}}

{{cd ..}}

{{git clone [https://github.com/apache/arrow.git]}}
 cd arrow
git submodule init
 git submodule update
 export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
 export ARROW_TEST_DATA="${PWD}/testing/data"
 mkdir cpp/build
 cd cpp/build
 $HOME/cmake-3.21.3-install/bin/cmake .. -DARROW_PARQUET=ON -DARROW_COMPUTE=ON \
 -DARROW_CSV=ON -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON \
{{-DThrift_SOURCE=BUNDLED -DPARQUET_REQUIRE_ENCRYPTION=ON}}

{color:#00}export PATH=$HOME/cmake-3.21.3-install/bin:$PATH{color}
 make

{{make unittest}}

After running make unittest , the following tests fail

{color:#00}The following tests FAILED: {color}
    {color:#b21818} 36 - parquet-internals-test (Failed){color} 
    {color:#b21818} 37 - parquet-reader-test (Failed){color} 
    {color:#b21818} 39 - parquet-arrow-test (Failed){color} 
    {color:#b21818} 41 - parquet-encryption-test (Failed){color} 
    {color:#b21818} 42 - parquet-encryption-key-management-test 
(Failed){color}

Checking why this occurs.


was (Author: baksmj):
Current default compiler suite on Ubuntu 21.04 is gcc 10, which is what is used 
in the CI tests. Tried this as below:


 


yum update
{{ {{yum install gcc-c++ gcc bison flex git python3 make openssl-devel wget
{{ yum groupinstall "Development tools"}}

{{{color:#00}{color:#00}yum install centos-release-scl {color}
 yum install yum-config-manager 
 yum install devtoolset-10
 scl enable devtoolset-10 bash{color}}}

{{{color:#00}wget 
[https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3.tar.gz]
 {color}}}
{{ tar -xvf cmake-3.21.3.tar.gz  }}
{{ cd cmake-3.21.3 }}
{{ mkdir build }}
{{ cd build/ }}
{{ ../bootstrap --prefix=$HOME/cmake-3.21.3-install}}

{{make}}
{{ {{make install

cd ..

cd ..

{{git clone [https://github.com/apache/arrow.git]}}
{{ cd arrow}}
{{ git submodule init}}
{{ git submodule update}}
{{ export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"}}
{{ export ARROW_TEST_DATA="${PWD}/testing/data"}}
{{ mkdir cpp/build}}
{{ cd cpp/build}}
{{ $HOME/cmake-3.21.3-install/bin/cmake .. -DARROW_PARQUET=ON 
-DARROW_COMPUTE=ON \}}
{{ -DARROW_CSV=ON -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON \}}
{{ {{-DThrift_SOURCE=BUNDLED -DPARQUET_REQUIRE_ENCRYPTION=ON

{color:#00}export PATH=$HOME/cmake-3.21.3-install/bin:$PATH{color}
make

{{make unittest}}

After running make unittest , the following tests fail

{color:#00}The following tests FAILED: {color}
    {color:#b21818} 36 - parquet-internals-test (Failed){color} 
    {color:#b21818} 37 - parquet-reader-test (Failed){color} 
    {color:#b21818} 39 - parquet-arrow-test (Failed){color} 
    {color:#b21818} 41 - parquet-encryption-test (Failed){color} 
    {color:#b21818} 42 - parquet-encryption-key-management-test 
(Failed){color}
 


 

Checking why this occurs.

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Benson Muite (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421580#comment-17421580
 ] 

Benson Muite commented on ARROW-14152:
--

Current default compiler suite on Ubuntu 21.04 is gcc 10, which is what is used 
in the CI tests. Tried this as below:


 


yum update
{{ {{yum install gcc-c++ gcc bison flex git python3 make openssl-devel wget
{{ yum groupinstall "Development tools"}}

{{{color:#00}{color:#00}yum install centos-release-scl {color}
 yum install yum-config-manager 
 yum install devtoolset-10
 scl enable devtoolset-10 bash{color}}}

{{{color:#00}wget 
[https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3.tar.gz]
 {color}}}
{{ tar -xvf cmake-3.21.3.tar.gz  }}
{{ cd cmake-3.21.3 }}
{{ mkdir build }}
{{ cd build/ }}
{{ ../bootstrap --prefix=$HOME/cmake-3.21.3-install}}

{{make}}
{{ {{make install

cd ..

cd ..

{{git clone [https://github.com/apache/arrow.git]}}
{{ cd arrow}}
{{ git submodule init}}
{{ git submodule update}}
{{ export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"}}
{{ export ARROW_TEST_DATA="${PWD}/testing/data"}}
{{ mkdir cpp/build}}
{{ cd cpp/build}}
{{ $HOME/cmake-3.21.3-install/bin/cmake .. -DARROW_PARQUET=ON 
-DARROW_COMPUTE=ON \}}
{{ -DARROW_CSV=ON -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON \}}
{{ {{-DThrift_SOURCE=BUNDLED -DPARQUET_REQUIRE_ENCRYPTION=ON

{color:#00}export PATH=$HOME/cmake-3.21.3-install/bin:$PATH{color}
make

{{make unittest}}

After running make unittest , the following tests fail

{color:#00}The following tests FAILED: {color}
    {color:#b21818} 36 - parquet-internals-test (Failed){color} 
    {color:#b21818} 37 - parquet-reader-test (Failed){color} 
    {color:#b21818} 39 - parquet-arrow-test (Failed){color} 
    {color:#b21818} 41 - parquet-encryption-test (Failed){color} 
    {color:#b21818} 42 - parquet-encryption-key-management-test 
(Failed){color}
 


 

Checking why this occurs.

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14157) [C++] Refactor Abseil build in ThirdpartyToolchain

2021-09-28 Thread Carlos O'Ryan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos O'Ryan updated ARROW-14157:
--
Component/s: C++

> [C++] Refactor Abseil build in ThirdpartyToolchain
> --
>
> Key: ARROW-14157
> URL: https://issues.apache.org/jira/browse/ARROW-14157
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Both google-cloud-cpp and gRPC depend on Abseil.  We need to refactor the 
> Abseil build to its own macro so it can be more easily reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14157) [C++] Refactor Abseil build in ThirdpartyToolchain

2021-09-28 Thread Carlos O'Ryan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos O'Ryan updated ARROW-14157:
--
Parent: (was: ARROW-1231)
Issue Type: Improvement  (was: Sub-task)

> [C++] Refactor Abseil build in ThirdpartyToolchain
> --
>
> Key: ARROW-14157
> URL: https://issues.apache.org/jira/browse/ARROW-14157
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Both google-cloud-cpp and gRPC depend on Abseil.  We need to refactor the 
> Abseil build to its own macro so it can be more easily reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14157) [C++] Refactor Abseil build in ThirdpartyToolchain

2021-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14157:
---
Labels: pull-request-available  (was: )

> [C++] Refactor Abseil build in ThirdpartyToolchain
> --
>
> Key: ARROW-14157
> URL: https://issues.apache.org/jira/browse/ARROW-14157
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Both google-cloud-cpp and gRPC depend on Abseil.  We need to refactor the 
> Abseil build to its own macro so it can be more easily reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12183) [R] [CI] Clean up Windows CI jobs

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-12183.
---
  Assignee: (was: Jonathan Keane)
Resolution: Won't Fix

> [R] [CI] Clean up Windows CI jobs
> -
>
> Key: ARROW-12183
> URL: https://issues.apache.org/jira/browse/ARROW-12183
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Priority: Minor
>
> If we are going to keep ARROW-12143 We should remove 
> https://github.com/apache/arrow/blob/master/.github/workflows/r.yml#L178-L186 
> since it will always be true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14155) [Go] Add functions for creating fingerprints/hashes of data types and scalars

2021-09-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14155:
---
Labels: pull-request-available  (was: )

> [Go] Add functions for creating fingerprints/hashes of data types and scalars
> -
>
> Key: ARROW-14155
> URL: https://issues.apache.org/jira/browse/ARROW-14155
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Necessary for compute and dataset APIs for field references and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13853) [R] String to_title, to_lower, to_upper kernels

2021-09-28 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421552#comment-17421552
 ] 

Eduardo Ponce edited comment on ARROW-13853 at 9/28/21, 5:26 PM:
-

If (for now) we are going to not accept locale option but still want to match 
*stringr* API, I can think of any of these approaches:
 * *str_to_lower(x, locale = "en")* -> only accept "en" locale and error if 
other
 * *str_to_lower(x, locale = NULL)* -> only accept NULL locale (ie., not 
supported) and error if another value is given
 * *str_to_lower(x, locale)* -> use *missing()* function to R to detect locale 
arg, if it is provided then error out
 * *str_to_lower(x )* -> non-matching API

cc [~npr] Any suggestions?


was (Author: edponce):
If (for now) we are going to not accept locale option but still want to match 
*stringr* API, I can think of any of these approaches:
 * *str_to_lower(x, locale = "en")* -> only accept "en" locale and error if 
other
 * *str_to_lower(x, locale = NULL)* -> only accept NULL locale (ie., not 
supported) and error if another value is given
 * *str_to_lower(x, locale)* -> use *missing()* function to R to detect locale 
arg, if it is provided then error out
 * *str_to_lower(x)* -> non-matching API

cc [~npr] Any suggestions?

> [R] String to_title, to_lower, to_upper kernels
> ---
>
> Key: ARROW-13853
> URL: https://issues.apache.org/jira/browse/ARROW-13853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we 
> should add a test. Also the stringr function takes a "locale" argument which 
> is not handled here; we should either pass it to Arrow C++ if it supports it 
> (which I doubt) or error if a value is provided in R.
> This also applies to *str_to_lower* and *str_to_upper* kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13853) [R] String to_title, to_lower, to_upper kernels

2021-09-28 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421552#comment-17421552
 ] 

Eduardo Ponce edited comment on ARROW-13853 at 9/28/21, 5:26 PM:
-

If (for now) we are going to not accept locale option but still want to match 
*stringr* API, I can think of any of these approaches:
 * *str_to_lower(x, locale = "en")* -> only accept "en" locale and error if 
other
 * *str_to_lower(x, locale = NULL)* -> only accept NULL locale (ie., not 
supported) and error if another value is given
 * *str_to_lower(x, locale)* -> use *missing()* function to R to detect locale 
arg, if it is provided then error out
 * *str_to_lower(x)* -> non-matching API

cc [~npr] Any suggestions?


was (Author: edponce):
If (for now) we are going to not accept locale option but still want to match 
*stringr* API, I can think of any of these approaches:
 * *str_to_lower(x, locale = "en")* -> only accept "en" locale and error if 
other
 * *str_to_lower(x, locale = NULL)* -> only accept NULL locale (ie., not 
supported) and error if another value is given
 * *str_to_lower(x, locale)* -> use *missing()* function to R to detect locale 
arg, if it is provided then error out
 * *str_to_lower(x)* -> non-matching API

cc [~npr] Any suggestions?

> [R] String to_title, to_lower, to_upper kernels
> ---
>
> Key: ARROW-13853
> URL: https://issues.apache.org/jira/browse/ARROW-13853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we 
> should add a test. Also the stringr function takes a "locale" argument which 
> is not handled here; we should either pass it to Arrow C++ if it supports it 
> (which I doubt) or error if a value is provided in R.
> This also applies to *str_to_lower* and *str_to_upper* kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13853) [R] String to_title, to_lower, to_upper kernels

2021-09-28 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421552#comment-17421552
 ] 

Eduardo Ponce commented on ARROW-13853:
---

If (for now) we are going to not accept locale option but still want to match 
*stringr* API, I can think of any of these approaches:
 * *str_to_lower(x, locale = "en")* -> only accept "en" locale and error if 
other
 * *str_to_lower(x, locale = NULL)* -> only accept NULL locale (ie., not 
supported) and error if another value is given
 * *str_to_lower(x, locale)* -> use *missing()* function to R to detect locale 
arg, if it is provided then error out
 * *str_to_lower(x)* -> non-matching API

cc [~npr] Any suggestions?

> [R] String to_title, to_lower, to_upper kernels
> ---
>
> Key: ARROW-13853
> URL: https://issues.apache.org/jira/browse/ARROW-13853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we 
> should add a test. Also the stringr function takes a "locale" argument which 
> is not handled here; we should either pass it to Arrow C++ if it supports it 
> (which I doubt) or error if a value is provided in R.
> This also applies to *str_to_lower* and *str_to_upper* kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14157) [C++] Refactor Abseil build in ThirdpartyToolchain

2021-09-28 Thread Carlos O'Ryan (Jira)
Carlos O'Ryan created ARROW-14157:
-

 Summary: [C++] Refactor Abseil build in ThirdpartyToolchain
 Key: ARROW-14157
 URL: https://issues.apache.org/jira/browse/ARROW-14157
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Carlos O'Ryan
Assignee: Carlos O'Ryan


Both google-cloud-cpp and gRPC depend on Abseil.  We need to refactor the 
Abseil build to its own macro so it can be more easily reused.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10898) [C++] Investigate Table sort performance

2021-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-10898:
--

Assignee: Antoine Pitrou

> [C++] Investigate Table sort performance
> 
>
> Key: ARROW-10898
> URL: https://issues.apache.org/jira/browse/ARROW-10898
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> As a followup to ARROW-10796, it may be desirable to reimplement Table 
> sorting so as to first sort individual batches, then merge sort them together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14156) [C++] StructArray::Flatten is incorrect in some cases

2021-09-28 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-14156:

Summary: [C++] StructArray::Flatten is incorrect in some cases  (was: 
StructArray::Flatten is incorrect in some cases)

> [C++] StructArray::Flatten is incorrect in some cases
> -
>
> Key: ARROW-14156
> URL: https://issues.apache.org/jira/browse/ARROW-14156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: Micah Kornfield
>Priority: Major
>
> When trying to flatten a struct that has children that were sliced we see 
> incorrect results.
>  
> {code:title=Bar.java|borderStyle=solid}
> import pyarrow as pa
> a = py.array([1,2,3])
> sliceds = a.slice(1)
> composed_struct = pa.StructArray.from_buffers(pa.struct([pa.field("a", 
> sliceds.type)]), len(sliceds), [pa.array([True, False]).buffers()[1]], 
> children=[sliceds])
> >>> composed_struct
> 
> -- is_valid:
>   [
>     true,
>     false
>   ]
> -- child 0 type: int64
>   [
>     2,
>     3
>   ]
> >>> composed_struct.flatten()
> [
> [
>   null,
>   null
> ]]
> {code}
>  
> I believe the problems is 
> [here|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/array/array_nested.cc#L572]
>  the copy does not account for child array offset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14156) StructArray::Flatten is incorrect in some cases

2021-09-28 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-14156:
---

 Summary: StructArray::Flatten is incorrect in some cases
 Key: ARROW-14156
 URL: https://issues.apache.org/jira/browse/ARROW-14156
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 5.0.0
Reporter: Micah Kornfield


When trying to flatten a struct that has children that were sliced we see 
incorrect results.

 
{code:title=Bar.java|borderStyle=solid}
import pyarrow as pa

a = py.array([1,2,3])

sliceds = a.slice(1)

composed_struct = pa.StructArray.from_buffers(pa.struct([pa.field("a", 
sliceds.type)]), len(sliceds), [pa.array([True, False]).buffers()[1]], 
children=[sliceds])

>>> composed_struct

-- is_valid:
  [
    true,
    false
  ]

-- child 0 type: int64
  [
    2,
    3
  ]

>>> composed_struct.flatten()
[
[
  null,
  null
]]
{code}
 
I believe the problems is 
[here|https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/array/array_nested.cc#L572]
 the copy does not account for child array offset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421506#comment-17421506
 ] 

David Li commented on ARROW-14152:
--

Ah - just for future reference those .dev tags, AFAIK, are just artifacts of a 
CI pipeline and aren't special in any way. The actual release tags are tested, 
but the testing is mostly the same as the nightly CI + some manual verification 
for the release vote. (Hence, for old CentOS, there's no difference unless 
someone decides to bring up an old CentOS - perhaps we should just add it to 
nightly CI.)

For the googletest issue, I think CMake should dump a log with more information.

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Benson Muite (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421504#comment-17421504
 ] 

Benson Muite commented on ARROW-14152:
--

Thanks for the feedback, my assumption is that the tagged releases are more 
thoroughly tested, so should build more readily. Tried a build after installing 
GCC8 ( [https://www.softwarecollections.org/en/scls/rhscl/devtoolset-8/] ) as 
suggested on the mailing list. Release build without tests compiles for commit 
{color:#00}c20b377008901c3b31c2d8c4c809e176a258e86b{color} Debug build does 
not, some issue with google test:
{color:#00}[  8%] Performing build step for 'googletest_ep' {color}
CMake Error at 
/root/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-build-DEBUG.cm
ake:37 (message): 
  Command failed: 2
 


 

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14155) [Go] Add functions for creating fingerprints/hashes of data types and scalars

2021-09-28 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-14155:
-

 Summary: [Go] Add functions for creating fingerprints/hashes of 
data types and scalars
 Key: ARROW-14155
 URL: https://issues.apache.org/jira/browse/ARROW-14155
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol


Necessary for compute and dataset APIs for field references and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14035) [C++][Compute] Implement non-hash count_distinct aggregate kernel

2021-09-28 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-14035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421478#comment-17421478
 ] 

Percy Camilo Triveño Aucahuasi commented on ARROW-14035:


Draft PR https://github.com/apache/arrow/pull/11257

> [C++][Compute] Implement non-hash count_distinct aggregate kernel
> -
>
> Key: ARROW-14035
> URL: https://issues.apache.org/jira/browse/ARROW-14035
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Critical
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> ARROW-12728 added a {{hash_count_distinct}} hash aggregate kernel, but there 
> is no non-hash {{count_distinct}} aggregate kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14135) [Python] Missing Python tests for compute kernels

2021-09-28 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421446#comment-17421446
 ] 

Eduardo Ponce commented on ARROW-14135:
---

[~jorisvandenbossche] In this issue I was referring to kernels that do not have 
*FunctionOptions* or Python-specific implementation, so these are the ones 
binded automatically. Recently, in ARROW-13327, I consolidated/verify 
*FunctionOptions* including tests. But it never hurts to double-check if there 
are missing tests for kernels with Python options and/or implementation.

> [Python] Missing Python tests for compute kernels
> -
>
> Key: ARROW-14135
> URL: https://issues.apache.org/jira/browse/ARROW-14135
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Eduardo Ponce
>Priority: Major
> Fix For: 6.0.0
>
>
> PyArrow is missing tests for several compute kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13853) [R] String to_title, to_lower, to_upper kernels

2021-09-28 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-13853:
--
Description: 
ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we should 
add a test. Also the stringr function takes a "locale" argument which is not 
handled here; we should either pass it to Arrow C++ if it supports it (which I 
doubt) or error if a value is provided in R.

This also applies to *str_to_lower* and *str_to_upper* kernels.


  was:
ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we should 
add a test. Also the stringr function takes a "locale" argument which is not 
handled here; we should either pass it to Arrow C++ if it supports it (which I 
doubt) or error if a non-default value is provided in R.

This also applies to *str_to_lower* and *str_to_upper* kernels.



> [R] String to_title, to_lower, to_upper kernels
> ---
>
> Key: ARROW-13853
> URL: https://issues.apache.org/jira/browse/ARROW-13853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-12714 added the *str_to_title* kernel and a basic mapping, but we 
> should add a test. Also the stringr function takes a "locale" argument which 
> is not handled here; we should either pass it to Arrow C++ if it supports it 
> (which I doubt) or error if a value is provided in R.
> This also applies to *str_to_lower* and *str_to_upper* kernels.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6626) [Python] Handle nested "set" values as lists when converting to Arrow

2021-09-28 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6626:
-
Summary: [Python] Handle nested "set" values as lists when converting to 
Arrow  (was: [Python] Handle "set" values as lists when converting to Arrow)

> [Python] Handle nested "set" values as lists when converting to Arrow
> -
>
> Key: ARROW-6626
> URL: https://issues.apache.org/jira/browse/ARROW-6626
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> See current behavior
> {code}
> In [1]: pa.array([{1,2, 3}])  
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.array([{1,2, 3}])
> ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
> ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
> ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Could not convert {1, 2, 3} with type set: did not recognize 
> Python value type when inferring an Arrow data type
> In ../src/arrow/python/iterators.h, line 70, code: func(value, 
> static_cast(i), _going)
> In ../src/arrow/python/inference.cc, line 621, code: 
> inferrer.VisitSequence(obj, mask)
> In ../src/arrow/python/python_to_arrow.cc, line 1074, code: 
> InferArrowType(seq, mask, options.from_pandas, _type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6626) [Python] Handle nested "set" values as lists when converting to Arrow

2021-09-28 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-6626.
--
Resolution: Fixed

Issue resolved by pull request 11076
[https://github.com/apache/arrow/pull/11076]

> [Python] Handle nested "set" values as lists when converting to Arrow
> -
>
> Key: ARROW-6626
> URL: https://issues.apache.org/jira/browse/ARROW-6626
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> See current behavior
> {code}
> In [1]: pa.array([{1,2, 3}])  
>  
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in 
> > 1 pa.array([{1,2, 3}])
> ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
> ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
> ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Could not convert {1, 2, 3} with type set: did not recognize 
> Python value type when inferring an Arrow data type
> In ../src/arrow/python/iterators.h, line 70, code: func(value, 
> static_cast(i), _going)
> In ../src/arrow/python/inference.cc, line 621, code: 
> inferrer.VisitSequence(obj, mask)
> In ../src/arrow/python/python_to_arrow.cc, line 1074, code: 
> InferArrowType(seq, mask, options.from_pandas, _type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12563) Add space,add_months and datediff functions for string

2021-09-28 Thread Anthony Louis Gotlib Ferreira (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Louis Gotlib Ferreira reassigned ARROW-12563:
-

Assignee: Anthony Louis Gotlib Ferreira

> Add space,add_months and datediff functions for string
> --
>
> Key: ARROW-12563
> URL: https://issues.apache.org/jira/browse/ARROW-12563
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Dileep
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13975) [C++][Compute] Add decimal support to round functions

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13975:

Priority: Critical  (was: Major)

> [C++][Compute] Add decimal support to round functions
> -
>
> Key: ARROW-13975
> URL: https://issues.apache.org/jira/browse/ARROW-13975
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Eduardo Ponce
>Priority: Critical
>  Labels: kernel
> Fix For: 6.0.0
>
>
> Need to add Decimal support to the rounding compute functions. The PR for 
> adding round compute functions (ARROW-12744) only includes basic arithmetic 
> types (unsigned/signed int and floating-point).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14139) [IR] [C++] Table flatbuffer object fails to compile on older GCCs

2021-09-28 Thread Phillip Cloud (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-14139.
---
Fix Version/s: 6.0.0
   Resolution: Fixed

Fixed by https://github.com/apache/arrow/pull/11241

> [IR] [C++] Table flatbuffer object fails to compile on older GCCs
> -
>
> Key: ARROW-14139
> URL: https://issues.apache.org/jira/browse/ARROW-14139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Compute IR
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See https://github.com/duckdb/duckdb/pull/2331#issue-1007080354.
> The {{Table}} name in the compute IR fbs conflicts with an internal object 
> name of the flatbuffers library itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14122) [C++] interval comparison kernels

2021-09-28 Thread Phillip Cloud (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421416#comment-17421416
 ] 

Phillip Cloud edited comment on ARROW-14122 at 9/28/21, 1:56 PM:
-

[~jorgecarleitao] [~westonpace] [~houqp] I personally am +1 on [~westonpace]'s 
idea, since it allows Arrow compute to avoid having to bake in a specific 
behavior to the core Arrow type.

Do we need a new JIRA/set of JIRAs to track the work of adding the extension 
type?


was (Author: cpcloud):
[~jorgecarleitao][~westonpace][~houqp] I personally am +1 on [~westonpace]'s 
idea, since it allows Arrow compute to avoid having to bake in a specific 
behavior to the core Arrow type.

Do we need a new JIRA/set of JIRAs to track the work of adding the extension 
type?

> [C++] interval comparison kernels
> -
>
> Key: ARROW-14122
> URL: https://issues.apache.org/jira/browse/ARROW-14122
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Phillip Cloud
>Priority: Major
>  Labels: kernel
>
> Subtask for tracking interval comparison kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14122) [C++] interval comparison kernels

2021-09-28 Thread Phillip Cloud (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421416#comment-17421416
 ] 

Phillip Cloud commented on ARROW-14122:
---

[~jorgecarleitao][~westonpace][~houqp] I personally am +1 on [~westonpace]'s 
idea, since it allows Arrow compute to avoid having to bake in a specific 
behavior to the core Arrow type.

Do we need a new JIRA/set of JIRAs to track the work of adding the extension 
type?

> [C++] interval comparison kernels
> -
>
> Key: ARROW-14122
> URL: https://issues.apache.org/jira/browse/ARROW-14122
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Phillip Cloud
>Priority: Major
>  Labels: kernel
>
> Subtask for tracking interval comparison kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14154) [Dev] merge_arrow_pr.py script fails if head pointer can't be checked out

2021-09-28 Thread Phillip Cloud (Jira)
Phillip Cloud created ARROW-14154:
-

 Summary: [Dev] merge_arrow_pr.py script fails if head pointer 
can't be checked out
 Key: ARROW-14154
 URL: https://issues.apache.org/jira/browse/ARROW-14154
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Phillip Cloud


{code}
Merge complete (local ref PR_TOOL_MERGE_PR_11241_MASTER). Push to apache? 
(y/n): y
Enumerating objects: 17, done.
Counting objects: 100% (17/17), done.
Delta compression using up to 32 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 1.98 KiB | 1.98 MiB/s, done.
Total 9 (delta 7), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (7/7), completed with 7 local objects.
To github.com:apache/arrow.git
   e5dc9d411..689a0b696  PR_TOOL_MERGE_PR_11241_MASTER -> master
Restoring head pointer to 6254bf77
error: pathspec '6254bf77' did not match any file(s) known to git
Command failed: ['git', 'checkout', '6254bf77']
{code}

Why does the script need to do anything that mutates the state of the clone?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13588) [R] Empty character attributes not stored

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421380#comment-17421380
 ] 

Neal Richardson commented on ARROW-13588:
-

Will resolve in ARROW-12871

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: attributes, feather
> Fix For: 6.0.0
>
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13588) [R] Empty character attributes not stored

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-13588:
---

Assignee: Neal Richardson

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: attributes, feather
> Fix For: 6.0.0
>
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14150) [C++] Skip delimiter checking in CSV chunker if quoting is false

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14150.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11258
[https://github.com/apache/arrow/pull/11258]

> [C++] Skip delimiter checking in CSV chunker if quoting is false
> 
>
> Key: ARROW-14150
> URL: https://issues.apache.org/jira/browse/ARROW-14150
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Delimiter checking is not necessary for csv chunker if quoting is disabled. 
> Bypassing the delimiter checking improves performance significantly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14149) [C++][R] Support a "modified" hive style directory naming scheme

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14149:

Component/s: C++

> [C++][R] Support a "modified" hive style directory naming scheme
> 
>
> Key: ARROW-14149
> URL: https://issues.apache.org/jira/browse/ARROW-14149
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ryan Hafen
>Priority: Minor
>
> I am working on a project where I need to create and analyze parquet files 
> using Apache Arrow but the environment I'm working with does not allow "=" in 
> file paths, which the hive naming convention forces, e.g. "year=2007". While 
> I can specify the partitioning to not use the hive contention, I then lose 
> the variable names. This is problematic when I'm sharing the datasets with 
> others because they will have to specify the partitioning variables when 
> opening the dataset but they don't know what the partitioning variables are.
>  
> Would it be possible to allow a modified hive-style directory naming 
> convention that still preserves the variable name in the directory name? For 
> example, allowing a delimiter other than "="?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14149) [C++][R] Support a "modified" hive style directory naming scheme

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14149:

Summary: [C++][R] Support a "modified" hive style directory naming scheme  
(was: Support a "modified" hive style directory naming scheme)

> [C++][R] Support a "modified" hive style directory naming scheme
> 
>
> Key: ARROW-14149
> URL: https://issues.apache.org/jira/browse/ARROW-14149
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ryan Hafen
>Priority: Minor
>
> I am working on a project where I need to create and analyze parquet files 
> using Apache Arrow but the environment I'm working with does not allow "=" in 
> file paths, which the hive naming convention forces, e.g. "year=2007". While 
> I can specify the partitioning to not use the hive contention, I then lose 
> the variable names. This is problematic when I'm sharing the datasets with 
> others because they will have to specify the partitioning variables when 
> opening the dataset but they don't know what the partitioning variables are.
>  
> Would it be possible to allow a modified hive-style directory naming 
> convention that still preserves the variable name in the directory name? For 
> example, allowing a delimiter other than "="?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14124) [R] Timezone support in R <= 3.4

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14124.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11250
[https://github.com/apache/arrow/pull/11250]

> [R] Timezone support in R <= 3.4
> 
>
> Key: ARROW-14124
> URL: https://issues.apache.org/jira/browse/ARROW-14124
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The error is: 
> https://github.com/ursacomputing/crossbow/runs/3696604088#step:7:15209
> {code}
>  Timezone not present, cannot convert to string with timezone: %Y-%m-%d%z
> {code}
> Ok, I think I’ve got close to the source of the issue (though let me tell you 
> neither the source nor the docs for R are accurate [1] on this…)
> 3.4 (and earlier):
> {code:r}
> > attributes(c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "Etc/GMT+6"), 
> > NA)) 
> $class
> [1] "POSIXct" "POSIXt" 
> {code}
> 3.5 (and later):
> {code:r}
> > attributes(c(lubridate::ymd_hms("2018-10-07 19:04:05", tz = "Etc/GMT+6"), 
> > NA)) 
> $class
> [1] "POSIXct" "POSIXt" 
> $tzone
> [1] "Etc/GMT+6"
> {code}
> So R itself is dropping the {{tzone}} attribute when we use {{c()}}, that is 
> being passed to Arrow as such, and then when Arrow goes to print a timezone 
> it (rightfully!) complains that there is no timezone to be formatted into the 
> string. 
> This behavior actually sounds right (given the inputs), so I propose that we 
> catch the error R <=3.4 (or skip the test in r<=3.4)
> [1] - The documented behavior is current, but it didn't change at the same 
> time as the actual behavior.
> The docs starting in 4.1.0 state:
>  Using \code{\link{c}} on \code{"POSIXlt"} objects converts them to the
>   current time zone, and on \code{"POSIXct"} objects drops any
>   \code{"tzone"} attributes, unless they are all marked with the same
>   time zone.
> https://github.com/wch/r-source/blob/tags/R-4-1-0/src/library/base/man/DateTimeClasses.Rd#L180-L183
> The docs before that state:
>   Using \code{\link{c}} on \code{"POSIXlt"} objects converts them to the
>   current time zone, and on \code{"POSIXct"} objects drops any
>   \code{"tzone"} attributes (even if they are all marked with the same
>   time zone).
> https://github.com/wch/r-source/blob/tags/R-4-0-5/src/library/base/man/DateTimeClasses.Rd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13927) [R] Add Karl to the contributors list for the pacakge

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13927.
-
Resolution: Fixed

Issue resolved by pull request 11256
[https://github.com/apache/arrow/pull/11256]

> [R] Add Karl to the contributors list for the pacakge
> -
>
> Key: ARROW-13927
> URL: https://issues.apache.org/jira/browse/ARROW-13927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [~karldw] : As recognition of the contributions you have made, especially the 
> herculean effort with ARROW-12981 we would like to add you to the 
> contributors list of the R package. 
> If you are ok with this, would you mind giving us the email you would like to 
> have listed there (and ORCID, if you have one and want it there as well)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14140) [R] skip arrow_binary/arrow_large_binary class from R metadata

2021-09-28 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14140.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11240
[https://github.com/apache/arrow/pull/11240]

> [R] skip arrow_binary/arrow_large_binary class from R metadata
> --
>
> Key: ARROW-14140
> URL: https://issues.apache.org/jira/browse/ARROW-14140
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The classes `c("arrow_binary", "vctrs_vctr", "list")` and 
> `c("arrow_large_binary", "vctrs_vctr", "list")` should be removed in 
> `arrow_attributes()`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14136) [R] Cannot call allocate_arrow_schema()

2021-09-28 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421361#comment-17421361
 ] 

Neal Richardson commented on ARROW-14136:
-

Yes, it will fail earlier and more clearly (but hopefully won't fail at all!)

> [R] Cannot call allocate_arrow_schema()
> ---
>
> Key: ARROW-14136
> URL: https://issues.apache.org/jira/browse/ARROW-14136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Linux, g++-9.3.0, R-4.1.1
>Reporter: Laurent
>Priority: Major
>
> Calling the (private) function `arrow:::allocate_arrow_schema()` ends with an 
> error:
> {code:r}
> > library(arrow)
> > arrow:::allocate_arrow_schema()
> Error in arrow:::allocate_arrow_schema() : 
>   Cannot call allocate_arrow_schema(). See 
> https://arrow.apache.org/docs/r/articles/install.html for help installing 
> Arrow C++ libraries.
> {code}
> This used to work on the same machine with an earlier version (arrow v4). I 
> also noticed that during the installation of the package there is error 
> message in the logs:
>  
> {noformat}
> trying URL 'https://stat.ethz.ch/CRAN/src/contrib/arrow_5.0.0.2.tar.gz'
> Content type 'application/x-gzip' length 483642 bytes (472 KB)
> ==
> downloaded 472 KB
> * installing *source* package ‘arrow’ ...
> ** package ‘arrow’ successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Building C++ libraries
>  cmake
> gzip: stdin: unexpected end of file
> /usr/bin/tar: Unexpected EOF in archive
> /usr/bin/tar: Unexpected EOF in archive
> /usr/bin/tar: Error is not recoverable: exiting now
>  arrow  
>  Error building Arrow C++. Re-run with ARROW_R_DEV=true for debug 
> information.
> Warning message:
> In untar(cmake_tar, exdir = cmake_dir) :
>   ‘/usr/bin/tar -xf '/tmp/RtmpQM4Z5j/file3f157fe5ee31' -C 
> '/tmp/RtmpQM4Z5j/file3f15457f6d01'’ returned error code 2
> - NOTE ---
> See https://arrow.apache.org/docs/r/articles/install.html
> for help installing Arrow C++ libraries
> -
> ** libs
> g++ -std=gnu++11 -I"/usr/local/packages/R/4.1/lib/R/include" -DNDEBUG 
> -I../inst/include/  -I/usr/local/include   -fpic  -g -O2  -c RTasks.cpp -o 
> RTasks.o
> (...) keeps compiling and reports a successful installation in the end.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421349#comment-17421349
 ] 

David Li commented on ARROW-14152:
--

(FWIW, that 6.0.0.dev tag is significantly out of date - it's two months old)

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421347#comment-17421347
 ] 

David Li commented on ARROW-14152:
--

Note the problematic enum.h in this output was removed in a later commit. On 
the mailing list it's mentioned the current master has some issues, what are 
the issues there?

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13572) [C++][Python] Add basic ORC support to the pyarrow.datasets API

2021-09-28 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-13572.
---
Resolution: Fixed

Issue resolved by pull request 10991
[https://github.com/apache/arrow/pull/10991]

> [C++][Python] Add basic ORC support to the pyarrow.datasets API
> ---
>
> Key: ARROW-13572
> URL: https://issues.apache.org/jira/browse/ARROW-13572
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Rick Zamora
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: orc, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> There is significant interest in having directory-partitioned ORC support 
> from users of Dask.  Since Dask already leverages the pyarrow.datasets API 
> for parquet-formatted data, having ORC support through the same pyarrow API 
> would be extremely useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13474) [C++][Python] PyArrow crash when filter/take empty Extension array

2021-09-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13474.

Resolution: Fixed

Issue resolved by pull request 11227
[https://github.com/apache/arrow/pull/11227]

> [C++][Python] PyArrow crash when filter/take empty Extension array
> --
>
> Key: ARROW-13474
> URL: https://issues.apache.org/jira/browse/ARROW-13474
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0, 4.0.0, 4.0.1
> Environment: Python 3.7, Ubuntu 20.04
>Reporter: Paul Balanca
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> PyArrow is crashing when applying `filter` or `take` on already empty 
> extension arrays.
> The bug can be reproduced with the documentation example:
> {code:java}
> import pyarrow as pa
> class Point3DArray(pa.ExtensionArray):
> def to_numpy_array(self):
> return self.storage.flatten().to_numpy().reshape((-1, 3))
> class Point3DType(pa.PyExtensionType):
> def __init__(self):
> pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
> def __reduce__(self):
> return Point3DType, ()
> def __arrow_ext_class__(self):
> return Point3DArray
> storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
> arr = pa.ExtensionArray.from_storage(Point3DType(), storage)
> arr = arr.filter(pa.array([False, False]))
> # Crashing here...
> arr.filter(pa.array([], pa.bool_()))
> # Crashing as well...
> arr.take(pa.array([], pa.int32())){code}
> The underlying issue seems to be that the function `nulls` is not implemented 
> for extension types in the C++ codebase: 
> [https://github.com/apache/arrow/blob/6db88a9e946c98c59f179210a70bc05ef6a0a296/cpp/src/arrow/array/util.cc#L472]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Benson Muite (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Muite updated ARROW-14152:
-
Attachment: CentOS7-gcc4.8.5-makelog.out
CentOS7-gcc4.8.5-CMakeCache.txt

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
> Attachments: CentOS7-gcc4.8.5-CMakeCache.txt, 
> CentOS7-gcc4.8.5-makelog.out
>
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14152) [C++][Docs][Parquet] Trouble installing on Cent OS 7

2021-09-28 Thread Benson Muite (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421313#comment-17421313
 ] 

Benson Muite commented on ARROW-14152:
--

Consider adding  [re2|https://github.com/google/re2] as an optional dependency 
for building tests.

> [C++][Docs][Parquet] Trouble installing on Cent OS 7
> 
>
> Key: ARROW-14152
> URL: https://issues.apache.org/jira/browse/ARROW-14152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation, Parquet
> Environment: Cent OS 7
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>
> Installing on Cent OS 7 is not well documented and can be problematic, in 
> particular for debug builds. Create guidelines for this that could also be 
> turned into nightly tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13115) plasma.PlasmaClient do not disconnect when user tried to delete it

2021-09-28 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13115:
--
Component/s: C++ - Plasma

> plasma.PlasmaClient do not disconnect when user tried to delete it
> --
>
> Key: ARROW-13115
> URL: https://issues.apache.org/jira/browse/ARROW-13115
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma, Python
>Affects Versions: 4.0.0
>Reporter: Yuxian Meng
>Priority: Critical
>
> ```
> import pyarrow.plasma as plasma
> for _ in range(1):
> c = plasma.connect("/tmp/plasma")
> del c
> ```
> The above code turns out not to call c.disconnect() automatically, and will 
> cause `Connection to IPC socket failed` error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13152) Plasma server hangs on Get requests containing duplicate object IDs

2021-09-28 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13152:
--
Component/s: C++ - Plasma

> Plasma server hangs on Get requests containing duplicate object IDs
> ---
>
> Key: ARROW-13152
> URL: https://issues.apache.org/jira/browse/ARROW-13152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, C++ - Plasma
>Affects Versions: 4.0.1
> Environment: tested on Linux/Python 3.8/pyarrow 4.0.1
>Reporter: Bruce Martin
>Priority: Major
>
> If a plasma client issues a Get request containing duplicate object IDs, and 
> a timeout of -1, the server will hang.
> The logic at the end of `PlasmaStore::ProcessGetRequest()` only returns a 
> response when the number of satisfied requests match the number of *unique* 
> objects to wait for.  The former is calculated using the number of requested 
> object IDs, not the number of unique requested object IDs.
> To reproduce:
>  
> {code:java}
> # start the plasma store first
> from pyarrow import plasma
> client = plasma.connect("/tmp/plasma")
> oid = client.put("hello, world")
> print(client.get([oid, oid], -1))print("done.")
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14115) [Python] Remove deprecated pyarrow.serialization functionality

2021-09-28 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421295#comment-17421295
 ] 

Joris Van den Bossche commented on ARROW-14115:
---

No, I don't think we ever officially / explicitly decided to deprecate Plasma 
(although there are regularly mailing list threads asking about plasma status 
confirming it is basically unmaintained)

> [Python] Remove deprecated pyarrow.serialization functionality
> --
>
> Key: ARROW-14115
> URL: https://issues.apache.org/jira/browse/ARROW-14115
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >