date:20211216

[jira] [Updated] (ARROW-15144) [Java] Unable to read IPC file in master

2021-12-16 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15144:
-
Description: 
I think that PR https://github.com/apache/arrow/pull/11709 may have caused a 
regression in reading IPC files.

Attached is an arrow file that can't be read by the Java implementation, but it 
can be read by all other implementations. Its contents correspond exactly to 
the generated_primitive.json.gz used in integration tests.

Background:
The integration CI pipeline in Rust's arrow2 started failing after the PR 
mentioned above. The logs show that all but the Java implementation are able to 
consume the attached file (and more generally the files created by arrow2's 
implementation). The PR broke almost all tests, suggesting that it is not 
something specific to the file but a broader issue.

Log: 
https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/runs/14655/signedlogcontent/2?urlExpires=2021-12-17T05%3A35%3A25.6055769Z&urlSigningMethod=HMACV1&urlSignature=Nx7nRNdrcUCbtvOnnXAYGDEuSEJUiDT%2BU2jNcqqp%2FEs%3D

The logs also suggest that the Java implementation may be leaking memory when 
such an event happens.

{code:java}
2021-12-16T05:38:33.1575113Z 05:38:33.055 [main] DEBUG 
org.apache.arrow.vector.ipc.message.ArrowRecordBatch - Buffer in RecordBatch at 
5088, length: 2040
2021-12-16T05:38:33.1577399Z 05:38:33.076 [main] ERROR 
org.apache.arrow.memory.BaseAllocator - Memory was leaked by query. Memory 
leaked: (8928)
2021-12-16T05:38:33.1578667Z Allocator(ROOT) 0/8928/1771528/2147483647 
(res/actual/peak/limit)
2021-12-16T05:38:33.1579193Z 
2021-12-16T05:38:33.1579792Z Incompatible files
2021-12-16T05:38:33.1580427Z Different values in column:
2021-12-16T05:38:33.1595138Z bool_nonnullable: Bool not null at index 0: null 
!= false
2021-12-16T05:38:33.1597137Z 05:38:33.078 [main] ERROR 
org.apache.arrow.tools.Integration - Incompatible files
2021-12-16T05:38:33.1598669Z java.lang.IllegalArgumentException: Different 
values in column:
2021-12-16T05:38:33.1599788Z bool_nonnullable: Bool not null at index 0: null 
!= false
2021-12-16T05:38:33.1601330Zat 
org.apache.arrow.vector.util.Validator.compareFieldVectors(Validator.java:133)
2021-12-16T05:38:33.1603803Zat 
org.apache.arrow.vector.util.Validator.compareVectorSchemaRoot(Validator.java:107)
2021-12-16T05:38:33.1605836Zat 
org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:209)
2021-12-16T05:38:33.1607342Zat 
org.apache.arrow.tools.Integration.run(Integration.java:119)
2021-12-16T05:38:33.1608817Zat 
org.apache.arrow.tools.Integration.main(Integration.java:70)
2021-12-16T05:38:33.1610327ZSuppressed: java.lang.IllegalStateException: 
Memory was leaked by query. Memory leaked: (8928)
2021-12-16T05:38:33.1611471Z Allocator(ROOT) 0/8928/1771528/2147483647 
(res/actual/peak/limit)
2021-12-16T05:38:33.1612372Z 
2021-12-16T05:38:33.1613537Zat 
org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
2021-12-16T05:38:33.1615288Zat 
org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
2021-12-16T05:38:33.1616926Zat 
org.apache.arrow.tools.Integration$Command$3.$closeResource(Integration.java:228)
2021-12-16T05:38:33.1618454Zat 
org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:228)
2021-12-16T05:38:33.1619500Z... 2 common frames omitted
2021-12-16T05:38:33.1619935Z 
2021-12-16T05:38:33.1620598Z --
{code}

I can't discard the possibility that this is an issue in arrow2 and an 
undefined issue in the implementation - I am raising it here because all other 
implementations can read the files.

For reference, the offending field (second column, bool_nonnullable), contains 
the following values buffer:

```
validity buffer: [0, 0, 0, 0, 0, 0, 0, 0]
values buffer: [0b1100, 0b1110010, 0, 0, 0, 0, 0, 0]
```
and the FieldNode has null_count = 0. I would expect this situation to yield an 
array without null values.


  was:
I think that PR https://github.com/apache/arrow/pull/11709 may have caused a 
regression in reading IPC files.

Attached is an arrow file that can't be read by the Java implementation, but it 
can be read by all other implementations. Its contents correspond exactly to 
the generated_primitive.json.gz used in integration tests.

Background:
The integration CI pipeline in Rust's arrow2 started failing after the PR 
mentioned above. The logs show that all but the Java implementation are able to 
consume the attached file (and more generally the files created by arrow2's 
implementation). The PR broke almost all tests, suggesting that it is not 
something specific to the file but a broader issue.

Log: 
https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/r

[jira] [Created] (ARROW-15144) [Java] Unable to read IPC file in master

2021-12-16 Thread Jira

Jorge Leitão created ARROW-15144:


 Summary: [Java] Unable to read IPC file in master
 Key: ARROW-15144
 URL: https://issues.apache.org/jira/browse/ARROW-15144
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Jorge Leitão
 Attachments: generated_primitive.arrow

I think that PR https://github.com/apache/arrow/pull/11709 may have caused a 
regression in reading IPC files.

Attached is an arrow file that can't be read by the Java implementation, but it 
can be read by all other implementations. Its contents correspond exactly to 
the generated_primitive.json.gz used in integration tests.

Background:
The integration CI pipeline in Rust's arrow2 started failing after the PR 
mentioned above. The logs show that all but the Java implementation are able to 
consume the attached file (and more generally the files created by arrow2's 
implementation). The PR broke almost all tests, suggesting that it is not 
something specific to the file but a broader issue.

Log: 
https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/runs/14655/signedlogcontent/2?urlExpires=2021-12-17T05%3A35%3A25.6055769Z&urlSigningMethod=HMACV1&urlSignature=Nx7nRNdrcUCbtvOnnXAYGDEuSEJUiDT%2BU2jNcqqp%2FEs%3D

The logs also suggest that the Java implementation may be leaking memory when 
such an event happens.

{code:java}
2021-12-16T05:38:37.6833847Z 05:38:37.622 [main] ERROR 
org.apache.arrow.tools.Integration - Incompatible files
2021-12-16T05:38:37.6835533Z java.lang.IllegalArgumentException: Different 
values in column:
2021-12-16T05:38:37.6836731Z f11: Timestamp(SECOND, UTC) at index 0: null != 
-62135596800
2021-12-16T05:38:37.6838188Zat 
org.apache.arrow.vector.util.Validator.compareFieldVectors(Validator.java:133)
2021-12-16T05:38:37.6840563Zat 
org.apache.arrow.vector.util.Validator.compareVectorSchemaRoot(Validator.java:107)
2021-12-16T05:38:37.6842476Zat 
org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:209)
2021-12-16T05:38:37.6843841Zat 
org.apache.arrow.tools.Integration.run(Integration.java:119)
2021-12-16T05:38:37.6845214Zat 
org.apache.arrow.tools.Integration.main(Integration.java:70)
2021-12-16T05:38:37.6846597ZSuppressed: java.lang.IllegalStateException: 
Memory was leaked by query. Memory leaked: (894)
2021-12-16T05:38:37.6847623Z Allocator(ROOT) 0/894/442402/2147483647 
(res/actual/peak/limit)
2021-12-16T05:38:37.6848029Z 
2021-12-16T05:38:37.6848996Zat 
org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
2021-12-16T05:38:37.6851316Zat 
org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
2021-12-16T05:38:37.6882832Zat 
org.apache.arrow.tools.Integration$Command$3.$closeResource(Integration.java:228)
2021-12-16T05:38:37.6884294Zat 
org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:228)
2021-12-16T05:38:37.6885249Z... 2 common frames omitted
{code}

I can't discard the possibility that this is an issue in arrow2 and an 
undefined issue in the implementation - I am raising it here because all other 
implementations can read the files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15134) [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType

2021-12-16 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-15134.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11975
[https://github.com/apache/arrow/pull/11975]

> [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType
> -
>
> Key: ARROW-15134
> URL: https://issues.apache.org/jira/browse/ARROW-15134
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: GLib
>Reporter: Keisuke Okada
>Assignee: Keisuke Okada
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15134) [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType

2021-12-16 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-15134:
-
Summary: [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType  
(was: [GLib] Add GArrow Month, Day, Nano IntervalDataType)

> [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType
> -
>
> Key: ARROW-15134
> URL: https://issues.apache.org/jira/browse/ARROW-15134
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: GLib
>Reporter: Keisuke Okada
>Assignee: Keisuke Okada
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15143) Remove incorrect comment on API of Transform for StringBinaryTransformExecBase

2021-12-16 Thread Eduardo Ponce (Jira)

Eduardo Ponce created ARROW-15143:
-

 Summary: Remove incorrect comment on API of Transform for 
StringBinaryTransformExecBase
 Key: ARROW-15143
 URL: https://issues.apache.org/jira/browse/ARROW-15143
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Eduardo Ponce
Assignee: Eduardo Ponce
 Fix For: 7.0.0


The [comment describing the API of Transform method for classes that inherit 
from 
StringBinaryTransformExecBase|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L626]
 incorrectly states there is a last parameter for returning error status. 
Instead the Transform method indicates an error by returning a negative value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15140) [CI] move to v2 of checkouts for GHA

2021-12-16 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-15140.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11983
[https://github.com/apache/arrow/pull/11983]

> [CI] move to v2 of checkouts for GHA
> 
>
> Key: ARROW-15140
> URL: https://issues.apache.org/jira/browse/ARROW-15140
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/11974#discussion_r770942784



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-10154) [Python][Gandiva] Add Gandiva documentation to Python docs

2021-12-16 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-10154:
--

Assignee: (was: Will Jones)

> [Python][Gandiva] Add Gandiva documentation to Python docs
> --
>
> Key: ARROW-10154
> URL: https://issues.apache.org/jira/browse/ARROW-10154
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Will Jones
>Priority: Minor
>  Labels: documentation, gandiva, pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I am looking at using Gandiva's Python bindings in an experimental project, 
> and noticed there aren't any Sphinx docs. Assuming it's not too early, I 
> would be willing to add documentation in Sphinx.
> Two starting points for documentation:
>  * A how-to for creating filters and projections using the builders. The 
> first part of this blog post gives the right idea: 
> [https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/]
>  * API reference. If possible, it would be cool to have a table listing all 
> of the available functions in the registry.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-7205) [C++][Gandiva] Implement regexp_matches, regexp_like functions

2021-12-16 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-7205:
-

Assignee: (was: Projjal Chanda)

> [C++][Gandiva] Implement regexp_matches, regexp_like functions
> --
>
> Key: ARROW-7205
> URL: https://issues.apache.org/jira/browse/ARROW-7205
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Support following signature in Gandiva:
> FunctionSignature\{name =regexp_matches, return type =Bool, param types 
> =[Utf8, Utf8]}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables

2021-12-16 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461106#comment-17461106
 ] 

Will Jones commented on ARROW-15135:


There is no C++ (or Rust) implementation of Iceberg; the main implementation is 
in Java. My impression is that writers for these table formats are about 10x 
more complex than the readers (but I say that having implemented a reader and 
not a writer :)). I think it's reasonable to start by creating the reader + 
dataset logic within Arrow, though once we get to the writer I do have some 
questions as to whether there should just be a C++ implementation managed by 
the Iceberg project.

> [C++][R][Python] Support reading from Apache Iceberg tables
> ---
>
> Key: ARROW-15135
> URL: https://issues.apache.org/jira/browse/ARROW-15135
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>
> This is an umbrella issue for supporting the [Apache Iceberg table 
> format|https://iceberg.apache.org/].
> Dremio has a good overview of the format here: 
> https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15142) Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow

2021-12-16 Thread Karthik (Jira)

Karthik created ARROW-15142:
---

 Summary: Cannot mix struct and non-struct, non-null values error 
when saving nested types with PyArrow 
 Key: ARROW-15142
 URL: https://issues.apache.org/jira/browse/ARROW-15142
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.1
Reporter: Karthik


When trying to save a Pandas dataframe with a nested type (list within list, 
list within dict) using pyarrow engine, the following error is encountered

{color:#e75c58}ArrowInvalid{color}: ('cannot mix list and non-list, non-null 
values', 'Conversion failed for column A with type object')

 

Repro:
{code:java}
import pandas as pd
x = pd.DataFrame({"A": [[24, 27, [1, 1]]]})
x.to_parquet('/tmp/a.pqt', engine="pyarrow")  {code}
Doing a bit of googling, it appears that this is a known Arrow shortcoming. 
However, this is a commonly encountered datastructure, and 'fastparquet' 
handles this seamlessly. Is there a proposed timeline/plan for fixing this?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread F. H. (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461080#comment-17461080
 ] 

F. H. commented on ARROW-15141:
---

Thanks for the context [~apitrou] , we will try your suggestion :)

> [C++] Fatal error condition occurred in aws_thread_launch
> -
>
> Key: ARROW-15141
> URL: https://issues.apache.org/jira/browse/ARROW-15141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.0, 6.0.1
> Environment: - `uname -a`:
> Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> - `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
> pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
> python3.9.7   hb7a2778_3_cpythonconda-forge
> python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
> python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
> python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
> python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
> python_abi3.9  2_cp39conda-forge
> tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
> tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge
>Reporter: F. H.
>Priority: Major
>
> Hi, I am getting randomly the following error when first running inference 
> with a Tensorflow model and then writing the result to a `.parquet` file:
> {code}
> Fatal error condition occurred in 
> /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
>  aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, &thread_options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
>  [0x7ffb14235f19]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
>  [0x7ffb14227098]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
>  [0x7ffb1406ea43]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
>  [0x7ffb1406c35a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
>  [0x7ffb142a2f5a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
>  [0x7ffb147fd570]
> /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
> /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
> [0x562576609a51]
> /bin/bash: line 1: 2341494 Aborted                 (core dumped)
> {code}
> My colleague ran into the same issue on Centos 8 while running the same job + 
> same environment on SLURM, so I guess it could be some issue with tensorflow 
> + pyarrow.
> Also I found a github issue with multiple people running into the same issue:
> [https://github.com/huggingface/datasets/issues/3310]
>  
> It would be very important to my lab that this bug gets resolved, as we 
> cannot work with parquet any more. Unfortunately, we do not have the 
> knowledge to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables

2021-12-16 Thread Will Jones (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461078#comment-17461078
 ] 

Will Jones commented on ARROW-15135:


I agree Weston, it should just be a dataset factory. The metadata just 
determines the list of files.

I'm not sure this is true of Iceberg, but in my experience with Delta Lake 
(which has a similar design) the metadata itself can become quite large. So we 
may want to block this work on an Avro -> Arrow reader (ARROW-1209).

> [C++][R][Python] Support reading from Apache Iceberg tables
> ---
>
> Key: ARROW-15135
> URL: https://issues.apache.org/jira/browse/ARROW-15135
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>
> This is an umbrella issue for supporting the [Apache Iceberg table 
> format|https://iceberg.apache.org/].
> Dremio has a good overview of the format here: 
> https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461075#comment-17461075
 ] 

Antoine Pitrou commented on ARROW-15141:


Since you are using conda/mamba, a workaround should be to switch to an older 
version of aws-sdk-cpp such as {{aws-sdk-cpp=1.8.186}}.

> [C++] Fatal error condition occurred in aws_thread_launch
> -
>
> Key: ARROW-15141
> URL: https://issues.apache.org/jira/browse/ARROW-15141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.0, 6.0.1
> Environment: - `uname -a`:
> Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> - `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
> pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
> python3.9.7   hb7a2778_3_cpythonconda-forge
> python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
> python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
> python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
> python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
> python_abi3.9  2_cp39conda-forge
> tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
> tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge
>Reporter: F. H.
>Priority: Major
>
> Hi, I am getting randomly the following error when first running inference 
> with a Tensorflow model and then writing the result to a `.parquet` file:
> {code}
> Fatal error condition occurred in 
> /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
>  aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, &thread_options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
>  [0x7ffb14235f19]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
>  [0x7ffb14227098]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
>  [0x7ffb1406ea43]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
>  [0x7ffb1406c35a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
>  [0x7ffb142a2f5a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
>  [0x7ffb147fd570]
> /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
> /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
> [0x562576609a51]
> /bin/bash: line 1: 2341494 Aborted                 (core dumped)
> {code}
> My colleague ran into the same issue on Centos 8 while running the same job + 
> same environment on SLURM, so I guess it could be some issue with tensorflow 
> + pyarrow.
> Also I found a github issue with multiple people running into the same issue:
> [https://github.com/huggingface/datasets/issues/3310]
>  
> It would be very important to my lab that this bug gets resolved, as we 
> cannot work with parquet any more. Unfortunately, we do not have the 
> knowledge to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461073#comment-17461073
 ] 

Antoine Pitrou commented on ARROW-15141:


Thanks for the report. This is very likely this issue: 
https://github.com/aws/aws-sdk-cpp/issues/1809

> [C++] Fatal error condition occurred in aws_thread_launch
> -
>
> Key: ARROW-15141
> URL: https://issues.apache.org/jira/browse/ARROW-15141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.0, 6.0.1
> Environment: - `uname -a`:
> Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> - `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
> pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
> python3.9.7   hb7a2778_3_cpythonconda-forge
> python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
> python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
> python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
> python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
> python_abi3.9  2_cp39conda-forge
> tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
> tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge
>Reporter: F. H.
>Priority: Major
>
> Hi, I am getting randomly the following error when first running inference 
> with a Tensorflow model and then writing the result to a `.parquet` file:
> {code}
> Fatal error condition occurred in 
> /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
>  aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, &thread_options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
>  [0x7ffb14235f19]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
>  [0x7ffb14227098]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
>  [0x7ffb1406ea43]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
>  [0x7ffb1406c35a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
>  [0x7ffb142a2f5a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
>  [0x7ffb147fd570]
> /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
> /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
> [0x562576609a51]
> /bin/bash: line 1: 2341494 Aborted                 (core dumped)
> {code}
> My colleague ran into the same issue on Centos 8 while running the same job + 
> same environment on SLURM, so I guess it could be some issue with tensorflow 
> + pyarrow.
> Also I found a github issue with multiple people running into the same issue:
> [https://github.com/huggingface/datasets/issues/3310]
>  
> It would be very important to my lab that this bug gets resolved, as we 
> cannot work with parquet any more. Unfortunately, we do not have the 
> knowledge to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15141:
---
Description: 
Hi, I am getting randomly the following error when first running inference with 
a Tensorflow model and then writing the result to a `.parquet` file:

{code}
Fatal error condition occurred in 
/home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
 aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application

Stack trace:

/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
 [0x7ffb14235f19]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
 [0x7ffb14227098]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
 [0x7ffb1406ea43]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
 [0x7ffb1406c35a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
 [0x7ffb142a2f5a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
 [0x7ffb147fd570]
/lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
/home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
[0x562576609a51]
/bin/bash: line 1: 2341494 Aborted                 (core dumped)
{code}


My colleague ran into the same issue on Centos 8 while running the same job + 
same environment on SLURM, so I guess it could be some issue with tensorflow + 
pyarrow.

Also I found a github issue with multiple people running into the same issue:
[https://github.com/huggingface/datasets/issues/3310]

 

It would be very important to my lab that this bug gets resolved, as we cannot 
work with parquet any more. Unfortunately, we do not have the knowledge to fix 
it.

  was:
Hi, I am getting randomly the following error when first running inference with 
a Tensorflow model and then writing the result to a `.parquet` file:
```

Fatal error condition occurred in 
/home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
 aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application

Stack trace:

/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
 [0x7ffb14235f19]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
 [0x7ffb14227098]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
 [0x7ffb1406ea43]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
 [0x7ffb1406c35a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
 [0x7ffb142a2f5a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
 [0x7ffb147fd570]
/lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
/home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
[0x562576609a51]
/bin/bash: line 1: 2341494 Aborted                 (core dumped)

```

 

My colleague ran into the same issue on Centos 8 while running the same

[jira] [Updated] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15141:
---
Summary: [C++] Fatal error condition occurred in aws_thread_launch  (was: 
Fatal error condition occurred in aws_thread_launch)

> [C++] Fatal error condition occurred in aws_thread_launch
> -
>
> Key: ARROW-15141
> URL: https://issues.apache.org/jira/browse/ARROW-15141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.0, 6.0.1
> Environment: - `uname -a`:
> Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> - `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
> pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
> python3.9.7   hb7a2778_3_cpythonconda-forge
> python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
> python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
> python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
> python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
> python_abi3.9  2_cp39conda-forge
> tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
> tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge
>Reporter: F. H.
>Priority: Major
>
> Hi, I am getting randomly the following error when first running inference 
> with a Tensorflow model and then writing the result to a `.parquet` file:
> ```
> Fatal error condition occurred in 
> /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
>  aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, &thread_options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
>  [0x7ffb14235f19]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
>  [0x7ffb14227098]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
>  [0x7ffb1406ea43]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
>  [0x7ffb1406c35a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
>  [0x7ffb142a2f5a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
>  [0x7ffb147fd570]
> /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
> /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
> [0x562576609a51]
> /bin/bash: line 1: 2341494 Aborted                 (core dumped)
> ```
>  
> My colleague ran into the same issue on Centos 8 while running the same job + 
> same environment on SLURM, so I guess it could be some issue with tensorflow 
> + pyarrow.
> Also I found a github issue with multiple people running into the same issue:
> [https://github.com/huggingface/datasets/issues/3310]
>  
> It would be very important to my lab that this bug gets resolved, as we 
> cannot work with parquet any more. Unfortunately, we do not have the 
> knowledge to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15141) Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15141:
---
Language:   (was: Python)

> Fatal error condition occurred in aws_thread_launch
> ---
>
> Key: ARROW-15141
> URL: https://issues.apache.org/jira/browse/ARROW-15141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.0, 6.0.1
> Environment: - `uname -a`:
> Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> - `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
> pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
> python3.9.7   hb7a2778_3_cpythonconda-forge
> python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
> python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
> python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
> python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
> python_abi3.9  2_cp39conda-forge
> tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
> tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge
>Reporter: F. H.
>Priority: Major
>
> Hi, I am getting randomly the following error when first running inference 
> with a Tensorflow model and then writing the result to a `.parquet` file:
> ```
> Fatal error condition occurred in 
> /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
>  aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, &thread_options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
>  [0x7ffb14235f19]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
>  [0x7ffb14227098]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
>  [0x7ffb1406ea43]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
>  [0x7ffb1406c35a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
>  [0x7ffb142a2f5a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
>  [0x7ffb147fd570]
> /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
> /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
> [0x562576609a51]
> /bin/bash: line 1: 2341494 Aborted                 (core dumped)
> ```
>  
> My colleague ran into the same issue on Centos 8 while running the same job + 
> same environment on SLURM, so I guess it could be some issue with tensorflow 
> + pyarrow.
> Also I found a github issue with multiple people running into the same issue:
> [https://github.com/huggingface/datasets/issues/3310]
>  
> It would be very important to my lab that this bug gets resolved, as we 
> cannot work with parquet any more. Unfortunately, we do not have the 
> knowledge to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15141) Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15141:
---
Component/s: C++

> Fatal error condition occurred in aws_thread_launch
> ---
>
> Key: ARROW-15141
> URL: https://issues.apache.org/jira/browse/ARROW-15141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.0, 6.0.1
> Environment: - `uname -a`:
> Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> - `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
> pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
> python3.9.7   hb7a2778_3_cpythonconda-forge
> python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
> python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
> python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
> python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
> python_abi3.9  2_cp39conda-forge
> tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
> tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
> tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge
>Reporter: F. H.
>Priority: Major
>
> Hi, I am getting randomly the following error when first running inference 
> with a Tensorflow model and then writing the result to a `.parquet` file:
> ```
> Fatal error condition occurred in 
> /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
>  aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
> el_group, &thread_options) == AWS_OP_SUCCESS
> Exiting Application
> 
> Stack trace:
> 
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
>  [0x7ffb14235f19]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
>  [0x7ffb14227098]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
>  [0x7ffb1406ea43]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
>  [0x7ffb1406c35a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
>  [0x7ffb14237fad]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
>  [0x7ffb142a2f5a]
> /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
>  [0x7ffb147fd570]
> /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
> /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
> /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
> [0x562576609a51]
> /bin/bash: line 1: 2341494 Aborted                 (core dumped)
> ```
>  
> My colleague ran into the same issue on Centos 8 while running the same job + 
> same environment on SLURM, so I guess it could be some issue with tensorflow 
> + pyarrow.
> Also I found a github issue with multiple people running into the same issue:
> [https://github.com/huggingface/datasets/issues/3310]
>  
> It would be very important to my lab that this bug gets resolved, as we 
> cannot work with parquet any more. Unfortunately, we do not have the 
> knowledge to fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15141) Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread F. H. (Jira)

F. H. created ARROW-15141:
-

 Summary: Fatal error condition occurred in aws_thread_launch
 Key: ARROW-15141
 URL: https://issues.apache.org/jira/browse/ARROW-15141
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.1, 6.0.0
 Environment: - `uname -a`:
Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
x86_64 x86_64 x86_64 GNU/Linux
- `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
python3.9.7   hb7a2778_3_cpythonconda-forge
python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
python_abi3.9  2_cp39conda-forge
tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge



Reporter: F. H.


Hi, I am getting randomly the following error when first running inference with 
a Tensorflow model and then writing the result to a `.parquet` file:
```

Fatal error condition occurred in 
/home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
 aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application

Stack trace:

/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
 [0x7ffb14235f19]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
 [0x7ffb14227098]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
 [0x7ffb1406ea43]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
 [0x7ffb1406c35a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
 [0x7ffb142a2f5a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
 [0x7ffb147fd570]
/lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
/home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
[0x562576609a51]
/bin/bash: line 1: 2341494 Aborted                 (core dumped)

```

 

My colleague ran into the same issue on Centos 8 while running the same job + 
same environment on SLURM, so I guess it could be some issue with tensorflow + 
pyarrow.

Also I found a github issue with multiple people running into the same issue:
[https://github.com/huggingface/datasets/issues/3310]

 

It would be very important to my lab that this bug gets resolved, as we cannot 
work with parquet any more. Unfortunately, we do not have the knowledge to fix 
it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15140) [CI] move to v2 of checkouts for GHA

2021-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15140:
---
Labels: pull-request-available  (was: )

> [CI] move to v2 of checkouts for GHA
> 
>
> Key: ARROW-15140
> URL: https://issues.apache.org/jira/browse/ARROW-15140
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/11974#discussion_r770942784



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15140) [CI] move to v2 of checkouts for GHA

2021-12-16 Thread Jonathan Keane (Jira)

Jonathan Keane created ARROW-15140:
--

 Summary: [CI] move to v2 of checkouts for GHA
 Key: ARROW-15140
 URL: https://issues.apache.org/jira/browse/ARROW-15140
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Jonathan Keane
Assignee: Jonathan Keane


https://github.com/apache/arrow/pull/11974#discussion_r770942784



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache

2021-12-16 Thread Eduardo Ponce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce closed ARROW-14436.
-
Resolution: Cannot Reproduce

> [C++] Disable color diagnostics when compiling with ccache
> --
>
> Key: ARROW-14436
> URL: https://issues.apache.org/jira/browse/ARROW-14436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/arrow/issues/11279]
> Steps to reproduce:
> Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
> CCACHE_LOGFILE=./ccache.log
> Find in ./ccache.log:
> Failed; falling back to running the real compiler
> Result: unsupported compiler option
> Dropping -fcolor-diagnostics fixes the issue.
> I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
> adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
> It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
> relevant for:
> -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache

2021-12-16 Thread Eduardo Ponce (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461036#comment-17461036
 ] 

Eduardo Ponce edited comment on ARROW-14436 at 12/16/21, 8:32 PM:
--

Not sure what was causing the issue here. I was not able to reproduce such 
error, but based on the SO discussion above (btw not related to Apache Arrow), 
others have stumbled upon similar issues.

I am closing this issue as [the reporter was able to prevent error by updating 
software|https://github.com/apache/arrow/pull/11518#issuecomment-949968012].


was (Author: edponce):
Not sure what was causing the issue here. I was not able to reproduce such 
error, but based on the SO discussion above (BTW not related to Apache Arrow), 
others have stumbled upon similar issues.

I am closing this issue as [the reporter was able to prevent error by updating 
software|https://github.com/apache/arrow/pull/11518#issuecomment-949968012].

> [C++] Disable color diagnostics when compiling with ccache
> --
>
> Key: ARROW-14436
> URL: https://issues.apache.org/jira/browse/ARROW-14436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/arrow/issues/11279]
> Steps to reproduce:
> Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
> CCACHE_LOGFILE=./ccache.log
> Find in ./ccache.log:
> Failed; falling back to running the real compiler
> Result: unsupported compiler option
> Dropping -fcolor-diagnostics fixes the issue.
> I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
> adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
> It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
> relevant for:
> -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache

2021-12-16 Thread Eduardo Ponce (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461036#comment-17461036
 ] 

Eduardo Ponce commented on ARROW-14436:
---

Not sure what was causing the issue here. I was not able to reproduce such 
error, but based on the SO discussion above (BTW not related to Apache Arrow), 
others have stumbled upon similar issues.

I am closing this issue as [the reporter was able to prevent error by updating 
software|https://github.com/apache/arrow/pull/11518#issuecomment-949968012].

> [C++] Disable color diagnostics when compiling with ccache
> --
>
> Key: ARROW-14436
> URL: https://issues.apache.org/jira/browse/ARROW-14436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/arrow/issues/11279]
> Steps to reproduce:
> Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
> CCACHE_LOGFILE=./ccache.log
> Find in ./ccache.log:
> Failed; falling back to running the real compiler
> Result: unsupported compiler option
> Dropping -fcolor-diagnostics fixes the issue.
> I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
> adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
> It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
> relevant for:
> -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache

2021-12-16 Thread Eduardo Ponce (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-14436:
--
Description: 
Copied from [https://github.com/apache/arrow/issues/11279]

Steps to reproduce:

Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
CCACHE_LOGFILE=./ccache.log
Find in ./ccache.log:
Failed; falling back to running the real compiler
Result: unsupported compiler option
Dropping -fcolor-diagnostics fixes the issue.

I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
relevant for:
-DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.

  was:
Copied from https://github.com/apache/arrow/issues/11279

Steps to reproduce:

Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
CCACHE_LOGFILE=./ccache.log
Find in ./ccache.log:
Failed; falling back to running the real compiler
Result: unsupported compiler option
Dropping -fcolor-diagnostics fixes the issue.

I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
relevant for:
-DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.

I can open a PR if you tell me which way you prefer.


> [C++] Disable color diagnostics when compiling with ccache
> --
>
> Key: ARROW-14436
> URL: https://issues.apache.org/jira/browse/ARROW-14436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/arrow/issues/11279]
> Steps to reproduce:
> Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
> CCACHE_LOGFILE=./ccache.log
> Find in ./ccache.log:
> Failed; falling back to running the real compiler
> Result: unsupported compiler option
> Dropping -fcolor-diagnostics fixes the issue.
> I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
> adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
> It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
> relevant for:
> -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-14853) [C++][Python] Cryptic error message when required compute options missing

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-14853.

Resolution: Fixed

Issue resolved by pull request 11961
[https://github.com/apache/arrow/pull/11961]

> [C++][Python] Cryptic error message when required compute options missing
> -
>
> Key: ARROW-14853
> URL: https://issues.apache.org/jira/browse/ARROW-14853
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: Kernels, pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This will confuse users:
> {code:python}
> >>> pc.partition_nth_indices([1,2,3])
> Traceback (most recent call last):
>   File "", line 1, in 
> pc.partition_nth_indices([1,2,3])
>   File "/home/antoine/arrow/dev/python/pyarrow/compute.py", line 202, in 
> wrapper
> return func.call(args, options, memory_pool)
>   File "pyarrow/_compute.pyx", line 327, in pyarrow._compute.Function.call
> result = GetResultValue(
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> raise ArrowInvalid(message)
> ArrowInvalid: Attempted to initialize KernelState from null FunctionOptions
> /home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:222  
> kernel->init(&kernel_ctx, {kernel, inputs, options})
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-11441) [R] Read CSV from character vector

2021-12-16 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461011#comment-17461011
 ] 

Neal Richardson commented on ARROW-11441:
-

Weston's suggestion works:

{code}
> x <- c("a,b", "1,2", "3,4")
> b <- buffer(charToRaw(paste(x, collapse = "\n")))
> read_csv_arrow(b)
  a b
1 1 2
2 3 4
{code}

read_csv_arrow already can read an arrow::Buffer, we just have to put the 
character vector into a Buffer. There's surely a more efficient way to do that, 
but this would work.

> [R] Read CSV from character vector
> --
>
> Key: ARROW-11441
> URL: https://issues.apache.org/jira/browse/ARROW-11441
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
> Fix For: 8.0.0
>
>
> `readr::read_csv()` lets you read in data from a character vector, useful for 
> (e.g.) taking the results of a system call and reading it in as a data.frame. 
> {code}
> > readr::read_csv(c("a,b", "1,2", "3,4"))
> # A tibble: 2 x 2
>   a b
>
> 1 1 2
> 2 3 4
> {code}
> One solution would be similar to ARROW-9235, perhaps, treating it as a 
> textConnection. 
> Another solution is to write to a tempfile.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15133) [CI] Remove util_checkout.sh and util_cleanup.sh scripts

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15133:
--

Assignee: Jonathan Keane

> [CI] Remove util_checkout.sh and util_cleanup.sh scripts
> 
>
> Key: ARROW-15133
> URL: https://issues.apache.org/jira/browse/ARROW-15133
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> - ci/scripts/util_checkout.sh was used to checkout submodules because 
> actions/checkout@v2 has removed support for that, but they have restored it 
> since.
> - ci/scripts/util_cleanup.sh was used to free up disk space on github actions 
> runners, because at that time it was limited to 7GB, from a recent run it 
> looks like the linux runners now have 32GB free space so we can try to 
> disable the cleanup step sparing almost a minute of build time



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15133) [CI] Remove util_checkout.sh and util_cleanup.sh scripts

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15133:
--

Assignee: Krisztian Szucs  (was: Jonathan Keane)

> [CI] Remove util_checkout.sh and util_cleanup.sh scripts
> 
>
> Key: ARROW-15133
> URL: https://issues.apache.org/jira/browse/ARROW-15133
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> - ci/scripts/util_checkout.sh was used to checkout submodules because 
> actions/checkout@v2 has removed support for that, but they have restored it 
> since.
> - ci/scripts/util_cleanup.sh was used to free up disk space on github actions 
> runners, because at that time it was limited to 7GB, from a recent run it 
> looks like the linux runners now have 32GB free space so we can try to 
> disable the cleanup step sparing almost a minute of build time



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15133) [CI] Remove util_checkout.sh and util_cleanup.sh scripts

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15133.

Resolution: Fixed

Issue resolved by pull request 11974
[https://github.com/apache/arrow/pull/11974]

> [CI] Remove util_checkout.sh and util_cleanup.sh scripts
> 
>
> Key: ARROW-15133
> URL: https://issues.apache.org/jira/browse/ARROW-15133
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> - ci/scripts/util_checkout.sh was used to checkout submodules because 
> actions/checkout@v2 has removed support for that, but they have restored it 
> since.
> - ci/scripts/util_cleanup.sh was used to free up disk space on github actions 
> runners, because at that time it was limited to 7GB, from a recent run it 
> looks like the linux runners now have 32GB free space so we can try to 
> disable the cleanup step sparing almost a minute of build time



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-13554) [C++] Remove deprecated Scanner::Scan

2021-12-16 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460997#comment-17460997
 ] 

Weston Pace commented on ARROW-13554:
-

Yes, I think this and ARROW-13338 will be really good to get into 7.0.0.

> [C++] Remove deprecated Scanner::Scan
> -
>
> Key: ARROW-13554
> URL: https://issues.apache.org/jira/browse/ARROW-13554
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Weston Pace
>Priority: Minor
> Fix For: 7.0.0
>
>
> It was already slated for removal in 5.0.0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.

2021-12-16 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace closed ARROW-1.
---
Resolution: Duplicate

> [C++] Implement task-based model into the executable-pipelines.
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 8.0.0
>
>
> The plan fragments will use a task-based model for parallel processing.
>  That means that there is a scheduler that gets assigned tasks for execution, 
> keeps selecting the next task for execution, and assigning it to a specific 
> executor (aka arrow thread pool).  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.

2021-12-16 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460992#comment-17460992
 ] 

Weston Pace commented on ARROW-1:
-

I think this is superceded by ARROW-15079 which has more description

> [C++] Implement task-based model into the executable-pipelines.
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 8.0.0
>
>
> The plan fragments will use a task-based model for parallel processing.
>  That means that there is a scheduler that gets assigned tasks for execution, 
> keeps selecting the next task for execution, and assigning it to a specific 
> executor (aka arrow thread pool).  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-14289) [C++] Change Scanner::Head to return a RecordBatchReader

2021-12-16 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-14289:
---

Assignee: Weston Pace

> [C++] Change Scanner::Head to return a RecordBatchReader
> 
>
> Key: ARROW-14289
> URL: https://issues.apache.org/jira/browse/ARROW-14289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Weston Pace
>Priority: Major
> Fix For: 7.0.0
>
>
> Following ARROW-9731 and ARROW-13893. This would make it more natural to work 
> with ExecPlans that return a RecordBatchReader when you Run them. 
> Alternatively, we could move the business to RecordBatchReader::Head.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-14705) [C++] unify_schemas can't handle int64 + double, affects CSV dataset

2021-12-16 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-14705:


Assignee: David Li

> [C++] unify_schemas can't handle int64 + double, affects CSV dataset
> 
>
> Key: ARROW-14705
> URL: https://issues.apache.org/jira/browse/ARROW-14705
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
>  Labels: query-engine
>
> Twitter question of "how can I make arrow's csv reader not make int64 for 
> integers", turns out to be originating from the scenario where some csvs in a 
> directory may have all integer values for a column but there are decimals in 
> others, and you can't use them together in a dataset.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> ds_dir <- tempfile()
> dir.create(ds_dir)
> cat("a\n1", file = file.path(ds_dir, "1.csv"))
> cat("a\n1.1", file = file.path(ds_dir, "2.csv"))
> ds <- open_dataset(ds_dir, format = "csv")
> ds
> #> FileSystemDataset with 2 csv files
> #> a: int64
> ## It just picked the schema of the first file
> collect(ds)
> #> Error: Invalid: Could not open CSV input source 
> '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx0gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv':
>  Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid 
> value '1.1'
> #> ../src/arrow/csv/converter.cc:492  decoder_.Decode(data, size, quoted, 
> &value)
> #> ../src/arrow/csv/parser.h:123  status
> #> ../src/arrow/csv/converter.cc:496  parser.VisitColumn(col_index, visit)
> #> ../src/arrow/csv/reader.cc:462  
> internal::UnwrapOrRaise(maybe_decoded_arrays)
> #> ../src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
> #> ../src/arrow/record_batch.cc:318  ReadNext(&batch)
> #> ../src/arrow/record_batch.cc:329  ReadAll(&batches)
> ## Let's try again and tell it to unify schemas. Should result in a float64 
> type
> ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
> #> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs 
> double
> #> ../src/arrow/type.cc:1621  fields_[i]->MergeWith(field)
> #> ../src/arrow/type.cc:1684  AddField(field)
> #> ../src/arrow/type.cc:1755  builder.AddSchema(schema)
> #> ../src/arrow/dataset/discovery.cc:251  Inspect(options.inspect_options)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables

2021-12-16 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460970#comment-17460970
 ] 

Weston Pace commented on ARROW-15135:
-

So off the top of my head I think this integration might take the form of a 
dataset factory:

The dataset factory would, given an iceberg table, consult the iceberg 
metadata.  From that metadata we can get:
 - The list of files
 - The format of the files (Parquet vs Orc)
 - The partitioning scheme
 - Potentially the filesystem?

We could then take those three things and create an ordinary FileSystemDataset.

Alternatively, we could create an IcebergDataset and IcebergFragment but I'm 
not sure there would be anything to gain by doing so.



> [C++][R][Python] Support reading from Apache Iceberg tables
> ---
>
> Key: ARROW-15135
> URL: https://issues.apache.org/jira/browse/ARROW-15135
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>
> This is an umbrella issue for supporting the [Apache Iceberg table 
> format|https://iceberg.apache.org/].
> Dremio has a good overview of the format here: 
> https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-10209) [Python] support positional arguments for options in compute wrapper

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10209.

Resolution: Fixed

Issue resolved by pull request 11955
[https://github.com/apache/arrow/pull/11955]

> [Python] support positional arguments for options in compute wrapper
> 
>
> Key: ARROW-10209
> URL: https://issues.apache.org/jira/browse/ARROW-10209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Maarten Breddels
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> As mentioned here:
> [https://github.com/apache/arrow/pull/8271#discussion_r500897047]
> we cannot support
> {code:java}
> pc.split_pattern(arr, "---")
> {code}
> where the second argument is a positional argument of the FunctionObject 
> class.
> I think it makes sense for a small subset (like this function) to support 
> non-keyword arguments.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15044) [C++] Add simple stdout/JSON exporter for OpenTelemetry

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-15044.

Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11925
[https://github.com/apache/arrow/pull/11925]

> [C++] Add simple stdout/JSON exporter for OpenTelemetry
> ---
>
> Key: ARROW-15044
> URL: https://issues.apache.org/jira/browse/ARROW-15044
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available, query-engine
> Fix For: 7.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> While it is odd for a library to configure OpenTelemetry exporters, there is 
> a reason for this: the actual end-user is unable to configure it directly (as 
> we use the C++ implementation, and the user is likely in Python or R), and 
> often the "important" bits are all in C++, so it would be nice to have a way 
> to dump the information collected.
> On top of that, it would be nice to dump the data in an easily parseable 
> format for further consumption.
> We should enable the ostream exporter, as well as a custom exporter that 
> dumps JSON. See 
> [https://github.com/open-telemetry/opentelemetry-cpp/pull/] for an 
> example. See discussion on [https://github.com/apache/arrow/pull/11906] as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-12526) [Python] Pre-generate pyarrow.compute members

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460958#comment-17460958
 ] 

Antoine Pitrou commented on ARROW-12526:


[~alippai] Can you show an example invocation of pylint or another such utility?

> [Python] Pre-generate  pyarrow.compute members
> --
>
> Key: ARROW-12526
> URL: https://issues.apache.org/jira/browse/ARROW-12526
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 4.0.0
>Reporter: Adam Lippai
>Priority: Minor
> Fix For: 7.0.0
>
>
> Static analysis tools (e.g. pylint) don't recognize simple members like 
> pyarrow.compute.equal, they report is as _missing_. Generating file (well a 
> file imported by this file I assume) 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/compute.py] 
> instead of runtime wrapping of the functions would improve the developer 
> experience.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-8998) [Python] Make NumPy an optional runtime dependency

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8998:
--
Issue Type: Improvement  (was: New Feature)

> [Python] Make NumPy an optional runtime dependency
> --
>
> Key: ARROW-8998
> URL: https://issues.apache.org/jira/browse/ARROW-8998
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 8.0.0
>
>
> Since in the relatively near future, one will be able to do non-trivial 
> analytical operations and query processing natively on Arrow data structures 
> through pyarrow, it does not make sense to require users to always install 
> NumPy when they install pyarrow. I propose to split the NumPy-depending parts 
> of libarrow_python into a libarrow_numpy (which also must be bundled) and 
> moving this part of the codebase into a separate Cython module.
> This refactoring should be relatively painless though there may be a number 
> of packaging details to chase up since this would introduce a new shared 
> library to be installed in various packaging targets. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-8998) [Python] Make NumPy an optional runtime dependency

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8998:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [Python] Make NumPy an optional runtime dependency
> --
>
> Key: ARROW-8998
> URL: https://issues.apache.org/jira/browse/ARROW-8998
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 8.0.0
>
>
> Since in the relatively near future, one will be able to do non-trivial 
> analytical operations and query processing natively on Arrow data structures 
> through pyarrow, it does not make sense to require users to always install 
> NumPy when they install pyarrow. I propose to split the NumPy-depending parts 
> of libarrow_python into a libarrow_numpy (which also must be bundled) and 
> moving this part of the codebase into a separate Cython module.
> This refactoring should be relatively painless though there may be a number 
> of packaging details to chase up since this would introduce a new shared 
> library to be installed in various packaging targets. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-7174) [Python] Expose parquet dictionary size write parameter in python.

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460957#comment-17460957
 ] 

Antoine Pitrou commented on ARROW-7174:
---

[~jorisvandenbossche]

> [Python] Expose parquet dictionary size write parameter in python.
> --
>
> Key: ARROW-7174
> URL: https://issues.apache.org/jira/browse/ARROW-7174
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: good-second-issue
> Fix For: 7.0.0
>
>
> In some cases it might be useful to have dictionaries larger then the current 
> default 1MB.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-13598) [C++] Deprecate Datum::COLLECTION

2021-12-16 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-13598.
--
Resolution: Fixed

Issue resolved by pull request 11979
[https://github.com/apache/arrow/pull/11979]

> [C++] Deprecate Datum::COLLECTION
> -
>
> Key: ARROW-13598
> URL: https://issues.apache.org/jira/browse/ARROW-13598
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It looks like "collection" datums are not used anywhere. Where we want to 
> return several pieces of data, we generally return a Struct array or scalar 
> wrapping them.
> Perhaps we should simply deprecate or even remove them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-2075) [Python] Add documentation section for integrations with PyTorch, TensorFlow

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2075:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [Python] Add documentation section for integrations with PyTorch, TensorFlow
> 
>
> Key: ARROW-2075
> URL: https://issues.apache.org/jira/browse/ARROW-2075
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 8.0.0
>
>
> This should also include any advice about issues with TensorFlow such as 
> packaging issues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15139) [Python] write_dataset's file_write_options are too confusing and/or undocumented

2021-12-16 Thread Weston Pace (Jira)

Weston Pace created ARROW-15139:
---

 Summary: [Python] write_dataset's file_write_options are too 
confusing and/or undocumented
 Key: ARROW-15139
 URL: https://issues.apache.org/jira/browse/ARROW-15139
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Weston Pace


ParquetFileWriteOptions / IpcFileWriteOptions / CsvFileWriteOptions are 
completely undocumented.

Also, creating them via the format is a little peculiar / confusing and I'm not 
sure why it is required.

It should be clear to the user what options they can set and how to set them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15060) [R] open_dataset() on csv files lacks support for compressed files

2021-12-16 Thread Carl Boettiger (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460947#comment-17460947
 ] 

Carl Boettiger commented on ARROW-15060:


My apologies, but it looks like it is read_csv_arrow over S3 which fails: 


{code:java}
## arrow network csv.gz
bench::bench_time({ # ERROR
  analysis <- arrow::s3_bucket(bucket = "analysis", endpoint_override = 
"data.ecoforecast.org", anonymous = TRUE)
  ds <- 
arrow::read_csv_arrow(analysis$OpenInputFile("combined_forecasts_scores.csv.gz"))
})
 {code}
The above example is on a public bucket, should be reproducible.  

No problems if I use `read_csv_arrow()` locally.  

Relatedly, does `write_csv_arrow()` gzip a csv if we add the ".gz" to the name? 
 Or is there a way to toggle compression when using `write_dataset` with 
`format="csv"` ?

> [R] open_dataset() on csv files lacks support for compressed files
> --
>
> Key: ARROW-15060
> URL: https://issues.apache.org/jira/browse/ARROW-15060
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Using open_dataset() on S3 buckets of csv files is a game-changing magic, 
> particularly with all the additional support for database / dplyr operations 
> over the remote connection, and the widespread adoption of S3 buckets even by 
> old-school big data providers like NOAA.
>  
> It's not uncommon to encounter buckets with *.csv.gz formats.  I know 
> technically this should be unnecessary, as compression can be done "in 
> flight" by the server, but usually this is not an issue for R users since R's 
> `connection` class automatically detects and gunzips compressed files (over 
> either POSIX or HTTP connections).  It would be really great if arrow could 
> handle this case too. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-10142) [C++] RecordBatchStreamReader should use StreamDecoder

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10142:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] RecordBatchStreamReader should use StreamDecoder
> --
>
> Key: ARROW-10142
> URL: https://issues.apache.org/jira/browse/ARROW-10142
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 8.0.0
>
>
> There's no reason to duplicate some of the stream reading logic, and re-using 
> StreamDecoder would ensure the behaviour of both classes matches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-13554) [C++] Remove deprecated Scanner::Scan

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460940#comment-17460940
 ] 

Antoine Pitrou commented on ARROW-13554:


[~westonpace] Do you think this can be done soon?

> [C++] Remove deprecated Scanner::Scan
> -
>
> Key: ARROW-13554
> URL: https://issues.apache.org/jira/browse/ARROW-13554
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Weston Pace
>Priority: Minor
> Fix For: 7.0.0
>
>
> It was already slated for removal in 5.0.0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-9441) [C++] Optimize RecordBatchReader::ReadAll

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9441:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Optimize RecordBatchReader::ReadAll
> -
>
> Key: ARROW-9441
> URL: https://issues.apache.org/jira/browse/ARROW-9441
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ji Liu
>Priority: Major
> Fix For: 8.0.0
>
>
> Based on perf reports, more time is spent manipulating C++ data structures 
> than reconstructing record batches from IPC messages, which strikes me as not 
> what we want
> here is from a perf report based on the Python code
> {code}
> for i in range(100):
> pa.ipc.open_stream('nyctaxi.arrow').read_all()
> {code}
> {code}
> -   50.40% 0.06%  python   libarrow.so.100.0.0  
> [.] arrow::RecordBatchReader::ReadAll
>- 50.34% arrow::RecordBatchReader::ReadAll 
>   - 25.86% arrow::Table::FromRecordBatches
>  - 18.41% arrow::SimpleRecordBatch::column
> - 16.00% arrow::MakeArray
>- 10.49% 
> arrow::VisitTypeInline  
> 7.71% arrow::PrimitiveArray::SetData   
> 1.87% arrow::StringArray::StringArray  
>1.54% __pthread_mutex_lock  
>0.88% __pthread_mutex_unlock
>0.67% std::_Hash_bytes  
>0.60% arrow::ChunkedArray::ChunkedArray 
>   - 22.30% arrow::RecordBatchReader::ReadAll   
>  - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext
> - 15.91% arrow::ipc::ReadRecordBatchInternal
>- 15.15% arrow::ipc::LoadRecordBatch
>   - 14.45% arrow::ipc::ArrayLoader::Load
>  + 13.15% arrow::VisitTypeInline
> + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 
> 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch
> {code}
> Perhaps {{ChunkedArray}} internally should be changed to contain a vector of 
> {{ArrayData}} instead of boxed Arrays. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8936:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Parallelize execution of arrow::compute::ScalarFunction
> -
>
> Key: ARROW-8936
> URL: https://issues.apache.org/jira/browse/ARROW-8936
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 8.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-9441) [C++] Optimize IPC stream reading

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9441:
--
Summary: [C++] Optimize IPC stream reading  (was: [C++] Optimize 
RecordBatchReader::ReadAll)

> [C++] Optimize IPC stream reading
> -
>
> Key: ARROW-9441
> URL: https://issues.apache.org/jira/browse/ARROW-9441
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ji Liu
>Priority: Major
> Fix For: 8.0.0
>
>
> Based on perf reports, more time is spent manipulating C++ data structures 
> than reconstructing record batches from IPC messages, which strikes me as not 
> what we want
> here is from a perf report based on the Python code
> {code}
> for i in range(100):
> pa.ipc.open_stream('nyctaxi.arrow').read_all()
> {code}
> {code}
> -   50.40% 0.06%  python   libarrow.so.100.0.0  
> [.] arrow::RecordBatchReader::ReadAll
>- 50.34% arrow::RecordBatchReader::ReadAll 
>   - 25.86% arrow::Table::FromRecordBatches
>  - 18.41% arrow::SimpleRecordBatch::column
> - 16.00% arrow::MakeArray
>- 10.49% 
> arrow::VisitTypeInline  
> 7.71% arrow::PrimitiveArray::SetData   
> 1.87% arrow::StringArray::StringArray  
>1.54% __pthread_mutex_lock  
>0.88% __pthread_mutex_unlock
>0.67% std::_Hash_bytes  
>0.60% arrow::ChunkedArray::ChunkedArray 
>   - 22.30% arrow::RecordBatchReader::ReadAll   
>  - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext
> - 15.91% arrow::ipc::ReadRecordBatchInternal
>- 15.15% arrow::ipc::LoadRecordBatch
>   - 14.45% arrow::ipc::ArrayLoader::Load
>  + 13.15% arrow::VisitTypeInline
> + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 
> 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch
> {code}
> Perhaps {{ChunkedArray}} internally should be changed to contain a vector of 
> {{ArrayData}} instead of boxed Arrays. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8936:
--
Issue Type: Improvement  (was: New Feature)

> [C++] Parallelize execution of arrow::compute::ScalarFunction
> -
>
> Key: ARROW-8936
> URL: https://issues.apache.org/jira/browse/ARROW-8936
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 8.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information

2021-12-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460937#comment-17460937
 ] 

David Li commented on ARROW-15138:
--

ExecPlan already does a topological sort of the nodes so you could probably 
infer indentation there too.

(I thought there was an issue about getting rid of multiple sinks? Though I am 
playing with a node that has two outputs so I can send data both to a Flight 
stream and a file writer.)

> [C++] Make ExecPlan::ToString give some additional information
> --
>
> Key: ARROW-15138
> URL: https://issues.apache.org/jira/browse/ARROW-15138
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Priority: Minor
>  Labels: compute, engine
>
> Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 
> However, that task is very open about the scope. So this ticket will try to 
> address some improvements for ExecPlan::ToString without the need to have 
> additional sophistication nor to extend the scope of features for 
> ExecPlan::ToString.
> This means that we won't need to research about GraphViz (or other backends) 
> here. We will just need to work to have these features for ExecPlan::ToString:
>  * Print index/Id for each node  in the physical tree
>  * Make sure all the nodes have names and correct index/IDs
>  * Apply indentation for the printing process (this may be an optional 
> setting in the API)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information

2021-12-16 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460934#comment-17460934
 ] 

Weston Pace commented on ARROW-15138:
-

I guess for indentation you could do djikstra's shortest path from the sink 
(all costs are 1, there is always only one sink I'm pretty sure) and then base 
the indentation on that (highest cost is least indented) but I don't know how 
visually appealing that would be.

> [C++] Make ExecPlan::ToString give some additional information
> --
>
> Key: ARROW-15138
> URL: https://issues.apache.org/jira/browse/ARROW-15138
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Priority: Minor
>  Labels: compute, engine
>
> Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 
> However, that task is very open about the scope. So this ticket will try to 
> address some improvements for ExecPlan::ToString without the need to have 
> additional sophistication nor to extend the scope of features for 
> ExecPlan::ToString.
> This means that we won't need to research about GraphViz (or other backends) 
> here. We will just need to work to have these features for ExecPlan::ToString:
>  * Print index/Id for each node  in the physical tree
>  * Make sure all the nodes have names and correct index/IDs
>  * Apply indentation for the printing process (this may be an optional 
> setting in the API)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460933#comment-17460933
 ] 

Antoine Pitrou commented on ARROW-1:


[~westonpace] Is this something that needs to be done, or a suggestion?

> [C++] Implement task-based model into the executable-pipelines.
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 7.0.0
>
>
> The plan fragments will use a task-based model for parallel processing.
>  That means that there is a scheduler that gets assigned tasks for execution, 
> keeps selecting the next task for execution, and assigning it to a specific 
> executor (aka arrow thread pool).  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Implement task-based model into the executable-pipelines.
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Alexander Ocsa
>Priority: Major
>  Labels: query-engine
> Fix For: 8.0.0
>
>
> The plan fragments will use a task-based model for parallel processing.
>  That means that there is a scheduler that gets assigned tasks for execution, 
> keeps selecting the next task for execution, and assigning it to a specific 
> executor (aka arrow thread pool).  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information

2021-12-16 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460932#comment-17460932
 ] 

Weston Pace commented on ARROW-15138:
-

Some of this stuff we may already do in the current ToString.  We have a 
"label" which is the "id" and, if the caller doesn't assign it, we already 
assign it based on a plan scoped counter (look for auto_label_counter_ in 
exec_plan.cc).  So that handles the first bullet.

I'm pretty sure most of the nodes have the correct name.

For plans that aren't trees we should print the node the first time we come 
across it and then, on future times we will come across it, print some kind of 
reference to the node based on the id/label.  For example, we could just throw  
a # in front of the label (#7).

> [C++] Make ExecPlan::ToString give some additional information
> --
>
> Key: ARROW-15138
> URL: https://issues.apache.org/jira/browse/ARROW-15138
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Priority: Minor
>  Labels: compute, engine
>
> Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 
> However, that task is very open about the scope. So this ticket will try to 
> address some improvements for ExecPlan::ToString without the need to have 
> additional sophistication nor to extend the scope of features for 
> ExecPlan::ToString.
> This means that we won't need to research about GraphViz (or other backends) 
> here. We will just need to work to have these features for ExecPlan::ToString:
>  * Print index/Id for each node  in the physical tree
>  * Make sure all the nodes have names and correct index/IDs
>  * Apply indentation for the printing process (this may be an optional 
> setting in the API)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13310) [C++] Implement hash_aggregate mode kernel

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13310:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Implement hash_aggregate mode kernel
> --
>
> Key: ARROW-13310
> URL: https://issues.apache.org/jira/browse/ARROW-13310
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>  Labels: kernel
> Fix For: 8.0.0
>
>
> Requires ARROW-12759.
> We have a scalar aggregate kernel for this already and hopefully the 
> implementation can be reused. Note, Pandas actually doesn't expose this in 
> DataFrameGroupBy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-7798) [R] Refactor R <-> Array conversion

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-7798:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Refactor R <-> Array conversion
> ---
>
> Key: ARROW-7798
> URL: https://issues.apache.org/jira/browse/ARROW-7798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Romain Francois
>Priority: Major
> Fix For: 8.0.0
>
>
> There's a bit of technical debt accumulated in array_to_vector and 
> vector_to_array:
>  * Mix of conversion *and* casting, ideally we'd move casting out of there 
> (at the cost of more memory copy). The rationale is that the conversion logic 
> will differ from the CastKernels, e.g. when to raise errors, benefits from 
> complex conversions like timezone... The current implementation is fast, e.g. 
> it fuses the conversion and casting in a single loop at the cost of code 
> clarity and divergence.
>  * There should be 2 paths, zero-copy, non zero-copy. The non-zero copy 
> should use the newly introduced VectorToArrayConverter which will work with 
> complex nested types.
>  * The in array_to vector, Converter should work primarily with Array and not 
> ArrayVector
>  * The vector_to_array should not use builders, sizes are known, the null 
> bitmap should be constructed separately. There's probably a chance that we 
> can re-use R's memory with zero-copy for the raw data.
>  * There seem to be multiple paths that do the same conversion: 
> [https://github.com/apache/arrow/pull/7514#discussion_r446706140]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13339) [C++] Implement hash_aggregate kernels (umbrella issue)

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13339:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Implement hash_aggregate kernels (umbrella issue)
> ---
>
> Key: ARROW-13339
> URL: https://issues.apache.org/jira/browse/ARROW-13339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>  Labels: kernel
> Fix For: 8.0.0
>
>
> Umbrella issue to track implementation of hash aggregate kernels.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13309) [C++] Implement hash_aggregate exact quantile kernel

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13309:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Implement hash_aggregate exact quantile kernel
> 
>
> Key: ARROW-13309
> URL: https://issues.apache.org/jira/browse/ARROW-13309
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>  Labels: kernel
> Fix For: 8.0.0
>
>
> Requires ARROW-12759.
> We have a scalar aggregate kernel for this already and hopefully the 
> implementation can be reused.
> Also create a pair of kernels {{median()}} and {{hash_median()}} that wrap 
> this for convenient use by bindings (like in ARROW-14052).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-8470) [Python][R] Expose incremental write API for Feather files

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-8470:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [Python][R] Expose incremental write API for Feather files
> --
>
> Key: ARROW-8470
> URL: https://issues.apache.org/jira/browse/ARROW-8470
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 8.0.0
>
>
> This is already available for writing IPC files, so this would mostly be an 
> interface to that with the addition of logic to handle conversions from 
> Python or R data frames and splitting the inputs based on the configured 
> Feather chunksize



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-9235) [R] Support for `connection` class when reading and writing files

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-9235:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Support for `connection` class when reading and writing files
> -
>
> Key: ARROW-9235
> URL: https://issues.apache.org/jira/browse/ARROW-9235
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Michael Quinn
>Priority: Major
> Fix For: 8.0.0
>
>
> We have an internal filesystem that we interact with through objects that 
> inherit from the connection class. These files aren't necessarily local, 
> making it slightly more complicated to read and write parquet files, for 
> example.
> For now, we're generating raw vectors and using that to create the file. For 
> example, to read files
> {noformat}
> ReadParquet <- function(filename, ...) {}}
>    file <-file(filename,"rb")
>    on.exit(close(file))
>    raw <- readBin(file, "raw", FileInfo(filename)$size)
>    return(arrow::read_parquet(raw, ...))
> }
> {noformat}
> And to write,
> {noformat}
> WriteParquet <- function(df, filepath, ...) {
>    stream <- BufferOutputStream$create()
>    write_parquet(df, stream, ...)
>    raw <- stream$finish()$data()
>file <- file(filepath, "wb")
>    on.exit(close(file)
>    writeBin(raw, file)
>    return(invisible())
> }
> {noformat}
> At the C++ level, we are interacting with ` R_new_custom_connection` defined 
> here:
>  [https://github.com/wch/r-source/blob/trunk/src/include/R_ext/Connections.h]
> I've been very impressed with how feature-rich arrow is. It would be nice to 
> see this API supported as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13530) [C++] Implement cumulative sum compute function

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13530:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Implement cumulative sum compute function
> ---
>
> Key: ARROW-13530
> URL: https://issues.apache.org/jira/browse/ARROW-13530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: good-first-issue, good-second-issue, kernel
> Fix For: 8.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-9293) [R] Add chunk_size to Table$create()

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-9293:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Add chunk_size to Table$create()
> 
>
> Key: ARROW-9293
> URL: https://issues.apache.org/jira/browse/ARROW-9293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
> Fix For: 8.0.0
>
>
> While working on ARROW-3308, I noticed that write_feather has a chunk_size 
> argument, which by default will write batches of 64k rows into the file. In 
> principle, a chunking strategy like this would prevent the need to bump up to 
> large_utf8 when ingesting a large character vector because you'd end up with 
> many chunks that each fit into a regular utf8 type. However, the way the 
> function works, the data.frame is converted to a Table with all ChunkedArrays 
> containing a single chunk first, which is where the large_utf8 type gets set. 
> But if Table$create() could be instructed to make multiple chunks, this would 
> be resolved.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-11071) [R][CI] Use processx to set up minio and flight servers in tests

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-11071:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R][CI] Use processx to set up minio and flight servers in tests
> 
>
> Key: ARROW-11071
> URL: https://issues.apache.org/jira/browse/ARROW-11071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
> Fix For: 8.0.0
>
>
> Rather than rely on them being set up outside of the tests. processx is 
> already a transitive test dependency (testthat uses it) so there's no reason 
> for us not to.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-1888:
-

Assignee: (was: Fernando Rodriguez)

> [C++] Implement casts from one struct type to another (with same field names 
> and number of fields)
> --
>
> Key: ARROW-1888
> URL: https://issues.apache.org/jira/browse/ARROW-1888
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics, good-first-issue, kernel
> Fix For: 7.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-11441) [R] Read CSV from character vector

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-11441:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Read CSV from character vector
> --
>
> Key: ARROW-11441
> URL: https://issues.apache.org/jira/browse/ARROW-11441
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
> Fix For: 8.0.0
>
>
> `readr::read_csv()` lets you read in data from a character vector, useful for 
> (e.g.) taking the results of a system call and reading it in as a data.frame. 
> {code}
> > readr::read_csv(c("a,b", "1,2", "3,4"))
> # A tibble: 2 x 2
>   a b
>
> 1 1 2
> 2 3 4
> {code}
> One solution would be similar to ARROW-9235, perhaps, treating it as a 
> textConnection. 
> Another solution is to write to a tempfile.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-11699) [R] Implement dplyr::across()

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-11699:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Implement dplyr::across()
> -
>
> Key: ARROW-11699
> URL: https://issues.apache.org/jira/browse/ARROW-11699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 8.0.0
>
>
> It's not a generic, but because it seems only to be called inside of 
> functions like `mutate()`, we can insert our own version of it into the NSE 
> data mask



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-11755) [R] Add tests from dplyr/test-mutate.r

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-11755:
---
Fix Version/s: (was: 7.0.0)

> [R] Add tests from dplyr/test-mutate.r
> --
>
> Key: ARROW-11755
> URL: https://issues.apache.org/jira/browse/ARROW-11755
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Review 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r 
> and port tests over to arrow as needed to see if there are edge cases we 
> aren't covering appropriately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-11841) [R][C++] Allow cancelling long-running commands

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-11841:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R][C++] Allow cancelling long-running commands
> ---
>
> Key: ARROW-11841
> URL: https://issues.apache.org/jira/browse/ARROW-11841
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> When calling a long-running task (for example reading a CSV file) from the R 
> prompt, users may want to interrupt with Ctrl-C.
> Allowing this will require integrating R's user interruption facility with 
> the cancellation API that's going to be exposed in C++ (see  ARROW-8732).
> Below some information I've gathered on the topic:
> There is some hairy discussion of how to interrupt C++ code from R at 
> https://stackoverflow.com/questions/40563522/r-how-to-write-interruptible-c-function-and-recover-partial-results
>  and https://stat.ethz.ch/pipermail/r-devel/2011-April/060714.html .
> It seems it may involve polling cpp11::check_user_interrupt() and catching 
> any cpp11::unwind_exception that may signal an interruption. A complication 
> is that apparently R APIs should only be called from the main thread. There's 
> also a small library which claims to make writing all this easier: 
> https://github.com/tnagler/RcppThread/blob/master/inst/include/RcppThread/RMonitor.hpp
> But since user interruptions will only be noticed by the R main thread, the 
> solution may be to launch heavy computations (e.g. CSV reading) in a separate 
> thread and have the main R thread periodically poll for interrupts while 
> waiting for the separate thread. This is what this dedicated thread class 
> does in its join method: 
> https://github.com/tnagler/RcppThread/blob/master/inst/include/RcppThread/Thread.hpp#L79



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-12059) [R] Accept format-specific scan options in collect()

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12059:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Accept format-specific scan options in collect()
> 
>
> Key: ARROW-12059
> URL: https://issues.apache.org/jira/browse/ARROW-12059
> Project: Apache Arrow
>  Issue Type: Task
>  Components: R
>Affects Versions: 4.0.0
>Reporter: David Li
>Priority: Major
>  Labels: dataset, datasets
> Fix For: 8.0.0
>
>
> ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most 
> natural place to accept these is in collect(), but this isn't yet done.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-12105) [R] Replace vars_select, vars_rename with eval_select, eval_rename

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12105:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Replace vars_select, vars_rename with eval_select, eval_rename
> --
>
> Key: ARROW-12105
> URL: https://issues.apache.org/jira/browse/ARROW-12105
> Project: Apache Arrow
>  Issue Type: Task
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 8.0.0
>
>
> The tidyselect {{vars_select}} and {{vars_rename}} functions are on their way 
> to being deprecated. Switch to using the newer {{eval_select}} and 
> {{eval_rename}} functions. This is required to use the {{where()}} selection 
> helper (ARROW-12778) Unfortunately this is not a trivial task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12137:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] New/improved vignette on dplyr features
> ---
>
> Key: ARROW-12137
> URL: https://issues.apache.org/jira/browse/ARROW-12137
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 8.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15117) [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects

2021-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15117:
---
Labels: pull-request-available  (was: )

> [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects
> --
>
> Key: ARROW-15117
> URL: https://issues.apache.org/jira/browse/ARROW-15117
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See the mailing list 
> (https://mail-archives.apache.org/mod_mbox/arrow-dev/202112.mbox/%3CCALQtMBbiasQtXYc46kpw-TyQ-TQSPjNQ5%2BkoREuKvJ3hJSdWjw%40mail.gmail.com%3E)
>  and this google doc 
> (https://docs.google.com/document/d/1AXDNwU5CSnZ1cSeUISwy_xgvTzoYWeuqWApC8UEv97Q/edit?usp=sharing)
>  for more context.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-12213) [R] copy_files doesn't make it easy to copy a single file

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12213:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] copy_files doesn't make it easy to copy a single file
> -
>
> Key: ARROW-12213
> URL: https://issues.apache.org/jira/browse/ARROW-12213
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 8.0.0
>
>
> copy_files (i.e. fs::CopyFiles) makes it trivial to recursively copy a 
> directory/bucket to or from S3, but I'm having a hard time downloading a 
> single file.
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-12778) [R] Support tidyselect where() selection helper in dplyr verbs

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12778:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Support tidyselect where() selection helper in dplyr verbs
> --
>
> Key: ARROW-12778
> URL: https://issues.apache.org/jira/browse/ARROW-12778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 8.0.0
>
>
> Since we can now determine the data type of an unevaluated array expression 
> (ARROW-12291) I think we should be able to support the {{where()}} selection 
> helper.
> This is already done for the {{relocate()}} verb (in ARROW-12781 ) but not 
> for any other verbs. 
> Steps required to do this:
>  # ARROW-12781 
>  # ARROW-12105
>  # Remove the {{check_select_helpers()}} function definition and remove all 
> the calls to it
>  # Modify any remaining the {{expect_error()}} tests that test {{where()}} 
> and check for the error message {{"Unsupported selection helper"}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-12711) [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12711:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()
> 
>
> Key: ARROW-12711
> URL: https://issues.apache.org/jira/browse/ARROW-12711
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
> Fix For: 8.0.0
>
>
> These are the aggregating versions of string concatenation—they combine 
> values from a set of rows into a single value. 
> The bindings for {{paste()}} and {{str_c()}} might be tricky to implement 
> because when these functions are called with the {{coallapse}} argument 
> unset, they do _not_ aggregate.
> In {{summarise()}} we need to be able to use scalar concatenation within 
> aggregate concatenation, like this: 
> {code:java}
> starwars %>%
>   filter(!is.na(hair_color) & !is.na(eye_color)) %>% 
>   group_by(homeworld) %>% 
>   summarise(hair_and_eyes = paste0(paste0(hair_color, "-haired and ", 
> eye_color, "-eyed"), collapse = ", ")){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-8626) [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8626:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool 
> ---
>
> Key: ARROW-8626
> URL: https://issues.apache.org/jira/browse/ARROW-8626
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Weston Pace
>Priority: Major
> Fix For: 8.0.0
>
>
> Currently, when submitting tasks to a thread pool, they are all commingled in 
> a common queue. When a new task submitter shows up, they must wait in the 
> back of the line behind all other queued tasks.
> A simple alternative to this would be round-robin scheduling, where each new 
> consumer is assigned a unique integer id, and the schedule / thread pool 
> internally maintains the tasks associated with the consumer in separate 
> queues. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13118) [R] Improve handling of R scalars in some nse_funcs

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13118:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Improve handling of R scalars in some nse_funcs
> ---
>
> Key: ARROW-13118
> URL: https://issues.apache.org/jira/browse/ARROW-13118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 8.0.0
>
>
> Some of the functions in {{nse_funcs}} do not behave properly when passed R 
> scalar input in expressions in dplyr verbs. Some examples:
> {code:r}
> Table$create(x = 1) %>% mutate(as.character(42))
> Table$create(x = 1) %>% mutate(is.character(("foo")))
> Table$create(x = 1) %>% mutate(nchar("foo"))
> Table$create(x = 1) %>% mutate(is.infinite(Inf))
> {code}
> This could be resolved by using {{build_expr()}} instead of 
> {{Expression$create()}}, but {{build_expr()}} is somewhat heavy. The only 
> part of it we really need to make this work is this:
> {code:r}
> args <- lapply(args, function(x) {
>   if (!inherits(x, "Expression")) {
> x <- Expression$scalar(x)
>   }
>   x
> }){code}
> If {{build_expr()}} is too heavy, we could make a function called 
> {{wrap_r_scalar}}, like this:
> {code:r}
> wrap_r_scalar <- function(x) {
>   if (!inherits(x "Expression")) {
> assert_that(
>   length(x) == 1,
>   msg = "Literal vectors of length != 1 not supported"
> )
> Expression$scalar(x)
>   } else {
> x
>   }
> }
> {code}
> and use it as needed in various of the {{nse_funcs}} functions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13337) [R] Define Math group generics

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13337:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Define Math group generics
> --
>
> Key: ARROW-13337
> URL: https://issues.apache.org/jira/browse/ARROW-13337
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 8.0.0
>
>
> Following ARROW-13280 we now support log, trigonometry, and other functions 
> that are in the {{Math}} S3 generic, so we should define it like how we've 
> already done for {{Ops}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460929#comment-17460929
 ] 

Antoine Pitrou commented on ARROW-13993:


Since the result would be non-deterministic, I'm not sure I understand the 
point of  a {{hash_take}} function compared to the {{hash_first}} proposal.

> [C++] Hash aggregate function that returns value from first row in group
> 
>
> Key: ARROW-13993
> URL: https://issues.apache.org/jira/browse/ARROW-13993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>  Labels: good-second-issue, kernel
> Fix For: 7.0.0
>
>
> It would be nice to have a hash aggregate function that returns the first 
> value of a column within each hash group.
> If row order within groups is non-deterministic, then effectively this would 
> return one arbitrary value. This is a very computationally cheap operation.
> This can be quite useful when querying a non-normalized table. For example if 
> you have a table with a {{country}} column and also a {{country_abbr}} column 
> and you want to group by either/both of those columns but return the values 
> from both columns, you could do
> {code:java}
> SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code}
> but it would be more efficient to do
> {code:java}
> SELECT country, first(country_abbr) FROM table GROUP BY country{code}
> because then the engine does not need to scan all the values of the 
> {{country_abbr}} column.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-14436:
---
Priority: Minor  (was: Major)

> [C++] Disable color diagnostics when compiling with ccache
> --
>
> Key: ARROW-14436
> URL: https://issues.apache.org/jira/browse/ARROW-14436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Eduardo Ponce
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Copied from https://github.com/apache/arrow/issues/11279
> Steps to reproduce:
> Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
> CCACHE_LOGFILE=./ccache.log
> Find in ./ccache.log:
> Failed; falling back to running the real compiler
> Result: unsupported compiler option
> Dropping -fcolor-diagnostics fixes the issue.
> I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
> adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
> It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
> relevant for:
> -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.
> I can open a PR if you tell me which way you prefer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13854) [R] More accurately determine output type of an aggregation expression

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13854:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] More accurately determine output type of an aggregation expression
> --
>
> Key: ARROW-13854
> URL: https://issues.apache.org/jira/browse/ARROW-13854
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
> Fix For: 8.0.0
>
>
> ARROW-13740 added some basic heuristics but the C++ library surely knows the 
> actual types, we just need to find out how to ask it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13767) [R] Add Arrow methods slice(), slice_head(), slice_tail()

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13767:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Add Arrow methods slice(), slice_head(), slice_tail()
> -
>
> Key: ARROW-13767
> URL: https://issues.apache.org/jira/browse/ARROW-13767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
> Fix For: 8.0.0
>
>
> Implement [{{slice()}}, {{slice_head()}}, and 
> {{slice_tail()}}|https://dplyr.tidyverse.org/reference/slice.html] methods 
> for {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects . I 
> believe this should be relatively straightforward, using {{Take()}} to return 
> only the specified rows. We already have a {{head()}} method which I believe 
> we can reuse for {{slice_head()}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache

2021-12-16 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460927#comment-17460927
 ] 

Antoine Pitrou commented on ARROW-14436:


I'm a bit surprised about this. I use ccache and have no problem with color 
diagnostics. Is it just because of an old version of ccache?

> [C++] Disable color diagnostics when compiling with ccache
> --
>
> Key: ARROW-14436
> URL: https://issues.apache.org/jira/browse/ARROW-14436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Copied from https://github.com/apache/arrow/issues/11279
> Steps to reproduce:
> Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 
> CCACHE_LOGFILE=./ccache.log
> Find in ./ccache.log:
> Failed; falling back to running the real compiler
> Result: unsupported compiler option
> Dropping -fcolor-diagnostics fixes the issue.
> I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or 
> adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS.
> It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also 
> relevant for:
> -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache.
> I can open a PR if you tell me which way you prefer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13926) [R] Support mutate/summarize with implicit join

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13926:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Support mutate/summarize with implicit join
> ---
>
> Key: ARROW-13926
> URL: https://issues.apache.org/jira/browse/ARROW-13926
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 8.0.0
>
>
> {code}
> mtcars %>%
>   group_by(cyl) %>%
>   mutate(x = hp - mean(hp)
> {code}
> essentially means something like
> {code}
> mtcars %>%
>   left_join(mtcars %>%
> group_by(cyl) %>%
> summarize(tmp = mean(hp))
>   ) %>%
>   mutate(x = hp - tmp) %>%
>   select(-tmp)
> {code}
> Apparently you can do the same inside summarize() too (though IDK if that's 
> behavior we want to encourage). Once we can do joins, we can support these 
> queries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information

2021-12-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460926#comment-17460926
 ] 

David Li commented on ARROW-15138:
--

One thing that's been requested is to have a way to rebuild an exec plan 
printed in one language (e.g. so that we can get an ExecPlan from R into C++ 
for debugging).

Doing that fully automatically is out of scope here (though: do we want some 
way to serialize an ExecPlan to/from an internal format? Or is that redundant 
with Substrait support?) but we should ensure all necessary data is included.

For options like indentation, we could add an (optional) options struct for 
those things. (It may also be hard in the general case because plans are not 
necessarily trees.) Otherwise I agree with the suggested improvements. One more 
thing to consider might be to ensure we generate unique names in case there are 
duplicate nodes (I think we already do so since we number them?)

> [C++] Make ExecPlan::ToString give some additional information
> --
>
> Key: ARROW-15138
> URL: https://issues.apache.org/jira/browse/ARROW-15138
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Priority: Minor
>  Labels: compute, engine
>
> Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 
> However, that task is very open about the scope. So this ticket will try to 
> address some improvements for ExecPlan::ToString without the need to have 
> additional sophistication nor to extend the scope of features for 
> ExecPlan::ToString.
> This means that we won't need to research about GraphViz (or other backends) 
> here. We will just need to work to have these features for ExecPlan::ToString:
>  * Print index/Id for each node  in the physical tree
>  * Make sure all the nodes have names and correct index/IDs
>  * Apply indentation for the printing process (this may be an optional 
> setting in the API)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14028) [R] Cast of NaN to integer should return NA_integer_

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14028:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Cast of NaN to integer should return NA_integer_
> 
>
> Key: ARROW-14028
> URL: https://issues.apache.org/jira/browse/ARROW-14028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 8.0.0
>
>
> Casting double {{NaN}} to integer returns a sentinel value:
> {code:r}
> call_function("cast", Scalar$create(NaN), options = list(to_type = int32(), 
> allow_float_truncate = TRUE))
> #> Scalar
> #> -2147483648
> call_function("cast", Scalar$create(NaN), options = list(to_type = int64(), 
> allow_float_truncate = TRUE))
> #> Scalar
> #> -9223372036854775808{code}
> It would be nice if this would instead return {{NA_integer}}.
> N.B. for some reason this doesn't reproduce in dplyr unless you round-trip it 
> back to double:
> {code:r}
> > Table$create(x = NaN) %>% transmute(as.double(as.integer(x))) %>% pull(1)
> #> [1] -2147483648{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14021) [R] Bind median() and quantile() to exact not approximate median and quantile

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14021:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Bind median() and quantile() to exact not approximate median and quantile
> -
>
> Key: ARROW-14021
> URL: https://issues.apache.org/jira/browse/ARROW-14021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 8.0.0
>
>
> ARROW-13772 binds {{quantile()}} to {{tdigest()}} which returns approximate 
> quantiles and binds {{median()}} to {{approximate_median()}} which returns an 
> approximate median. The bindings issue a warning saying that the 
> median/quantile is approximate. Once ARROW-13309 is implemented, modify the 
> binding to call Arrow functions that returns exact quantiles and medians, and 
> remove the warnings.
> We should keep the approximate quantile and median bindings but rename them.
> When doing this, we should also modify the bindings to accept {{type}} and 
> {{interpolation}} arguments like we do in the {{quantile.ArrowDatum}} method: 
> https://github.com/apache/arrow/blob/170a24fc5f9f6f9c7c7d15a716c5995f03204a4b/r/R/compute.R#L156-L187



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13598) [C++] Deprecate Datum::COLLECTION

2021-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13598:
---
Labels: pull-request-available  (was: )

> [C++] Deprecate Datum::COLLECTION
> -
>
> Key: ARROW-13598
> URL: https://issues.apache.org/jira/browse/ARROW-13598
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It looks like "collection" datums are not used anywhere. Where we want to 
> return several pieces of data, we generally return a Struct array or scalar 
> wrapping them.
> Perhaps we should simply deprecate or even remove them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14045:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
> Fix For: 8.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14085) [R] Expose null placement option through sort bindings

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14085:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Expose null placement option through sort bindings
> --
>
> Key: ARROW-14085
> URL: https://issues.apache.org/jira/browse/ARROW-14085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: kernel
> Fix For: 8.0.0
>
>
> ARROW-12063 added a null placement option to the sort kernels and to 
> {{OrderBySinkNode}} in the C++ library. Expose this through the R bindings.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14071) [R] Try to arrow_eval user-defined functions

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14071:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Try to arrow_eval user-defined functions
> 
>
> Key: ARROW-14071
> URL: https://issues.apache.org/jira/browse/ARROW-14071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Dewey Dunnington
>Priority: Major
> Fix For: 8.0.0
>
>
> The first test passes but the second one fails, even though they're 
> equivalent. The user's function isn't being evaluated in the nse_funcs 
> environment.
> {code}
>   expect_dplyr_equal(
> input %>%
>   select(-fct) %>%
>   filter(nchar(padded_strings) < 10) %>%
>   collect(),
> tbl
>   )
>   isShortString <- function(x) nchar(x) < 10
>   expect_dplyr_equal(
> input %>%
>   select(-fct) %>%
>   filter(isShortString(padded_strings)) %>%
>   collect(),
> tbl
>   )
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14199) [R] bindings for format where possible

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14199:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] bindings for format where possible
> --
>
> Key: ARROW-14199
> URL: https://issues.apache.org/jira/browse/ARROW-14199
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dewey Dunnington
>Priority: Major
> Fix For: 8.0.0
>
>
> Now that we have {{strftime}}, we should also be able to make bindings for 
> {{format()}} as well. This might be complicated / we might need to punt on a 
> bunch of types that {{format()}} can take but arrow doesn't (yet) support 
> formatting of them, that's ok. 
> Though some of those might be wrappable with a handful of kernels stacked 
> together: {{format(float)}} might be round + cast to character



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14159) [R] Re-allow some multithreading on Windows

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14159:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Re-allow some multithreading on Windows
> ---
>
> Key: ARROW-14159
> URL: https://issues.apache.org/jira/browse/ARROW-14159
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 8.0.0
>
>
> Followup to ARROW-8379, which set use_threads = FALSE on Windows. See 
> discussion about adding more controls, disabling threading in some places and 
> not others, etc. We want to do this soon after release so that we have a few 
> months to see how things behave on CI before releasing again.
> -
> Collecting some CI hangs after ARROW-8379
> 1. Rtools35, 64bit test suite hangs: 
> https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034
> {code}
> ** running tests for arch 'i386' ...
>   Running 'testthat.R' [17s]
>  OK
> ** running tests for arch 'x64' ...
> Error: Error:   stderr is not a pipe.>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14185) [C++] HashJoinNode should validate HashJoinOptions

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-14185:
---
Labels: good-first-issue query-engine  (was: query-engine)

> [C++] HashJoinNode should validate HashJoinOptions
> --
>
> Key: ARROW-14185
> URL: https://issues.apache.org/jira/browse/ARROW-14185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
>  Labels: good-first-issue, query-engine
> Fix For: 7.0.0
>
>
> At the very least key_cmp needs to match the number of join columns, there 
> may be other invariants to uphold. Probably all the ones from the [semi-join 
> PR|https://github.com/apache/arrow/pull/10845/files#diff-1028a02ecee91ae8a9fbb7318e1ce650760716d317457d48e0740c802287fbdbR38]:
>  there must be the same, non-zero number of keys on both sides.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-9842) [C++] Explore alternative strategy for Compare kernel implementation for better performance

2021-12-16 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9842:
--
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [C++] Explore alternative strategy for Compare kernel implementation for 
> better performance
> ---
>
> Key: ARROW-9842
> URL: https://issues.apache.org/jira/browse/ARROW-9842
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 8.0.0
>
> Attachments: movemask-in-chunks.diff, movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of 
> results is deferred until the end (or in chunks). Instead, a temporary 
> bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can 
> be bitpacked into the output buffer. This may also reduce the code size of 
> the compare kernels (which are actually quite large at the moment)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14264) [R] Support inequality joins

2021-12-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14264:
---
Fix Version/s: 8.0.0
   (was: 7.0.0)

> [R] Support inequality joins
> 
>
> Key: ARROW-14264
> URL: https://issues.apache.org/jira/browse/ARROW-14264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 8.0.0
>
>
> We'll need this not-yet-merged dplyr API to do it: 
> https://github.com/tidyverse/dplyr/pull/5910



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15137) [Dev] Update archery crossbow latest-prefix to work with nightly dates

2021-12-16 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15137:
--
Fix Version/s: 7.0.0

> [Dev] Update archery crossbow latest-prefix to work with nightly dates
> --
>
> Key: ARROW-15137
> URL: https://issues.apache.org/jira/browse/ARROW-15137
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

1 2 >

1 - 100 of 197 matches

Mail list logo