[jira] [Updated] (ARROW-15144) [Java] Unable to read IPC file in master
[ https://issues.apache.org/jira/browse/ARROW-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-15144: - Description: I think that PR https://github.com/apache/arrow/pull/11709 may have caused a regression in reading IPC files. Attached is an arrow file that can't be read by the Java implementation, but it can be read by all other implementations. Its contents correspond exactly to the generated_primitive.json.gz used in integration tests. Background: The integration CI pipeline in Rust's arrow2 started failing after the PR mentioned above. The logs show that all but the Java implementation are able to consume the attached file (and more generally the files created by arrow2's implementation). The PR broke almost all tests, suggesting that it is not something specific to the file but a broader issue. Log: https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/runs/14655/signedlogcontent/2?urlExpires=2021-12-17T05%3A35%3A25.6055769Z&urlSigningMethod=HMACV1&urlSignature=Nx7nRNdrcUCbtvOnnXAYGDEuSEJUiDT%2BU2jNcqqp%2FEs%3D The logs also suggest that the Java implementation may be leaking memory when such an event happens. {code:java} 2021-12-16T05:38:33.1575113Z 05:38:33.055 [main] DEBUG org.apache.arrow.vector.ipc.message.ArrowRecordBatch - Buffer in RecordBatch at 5088, length: 2040 2021-12-16T05:38:33.1577399Z 05:38:33.076 [main] ERROR org.apache.arrow.memory.BaseAllocator - Memory was leaked by query. Memory leaked: (8928) 2021-12-16T05:38:33.1578667Z Allocator(ROOT) 0/8928/1771528/2147483647 (res/actual/peak/limit) 2021-12-16T05:38:33.1579193Z 2021-12-16T05:38:33.1579792Z Incompatible files 2021-12-16T05:38:33.1580427Z Different values in column: 2021-12-16T05:38:33.1595138Z bool_nonnullable: Bool not null at index 0: null != false 2021-12-16T05:38:33.1597137Z 05:38:33.078 [main] ERROR org.apache.arrow.tools.Integration - Incompatible files 2021-12-16T05:38:33.1598669Z java.lang.IllegalArgumentException: Different values in column: 2021-12-16T05:38:33.1599788Z bool_nonnullable: Bool not null at index 0: null != false 2021-12-16T05:38:33.1601330Zat org.apache.arrow.vector.util.Validator.compareFieldVectors(Validator.java:133) 2021-12-16T05:38:33.1603803Zat org.apache.arrow.vector.util.Validator.compareVectorSchemaRoot(Validator.java:107) 2021-12-16T05:38:33.1605836Zat org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:209) 2021-12-16T05:38:33.1607342Zat org.apache.arrow.tools.Integration.run(Integration.java:119) 2021-12-16T05:38:33.1608817Zat org.apache.arrow.tools.Integration.main(Integration.java:70) 2021-12-16T05:38:33.1610327ZSuppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (8928) 2021-12-16T05:38:33.1611471Z Allocator(ROOT) 0/8928/1771528/2147483647 (res/actual/peak/limit) 2021-12-16T05:38:33.1612372Z 2021-12-16T05:38:33.1613537Zat org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437) 2021-12-16T05:38:33.1615288Zat org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29) 2021-12-16T05:38:33.1616926Zat org.apache.arrow.tools.Integration$Command$3.$closeResource(Integration.java:228) 2021-12-16T05:38:33.1618454Zat org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:228) 2021-12-16T05:38:33.1619500Z... 2 common frames omitted 2021-12-16T05:38:33.1619935Z 2021-12-16T05:38:33.1620598Z -- {code} I can't discard the possibility that this is an issue in arrow2 and an undefined issue in the implementation - I am raising it here because all other implementations can read the files. For reference, the offending field (second column, bool_nonnullable), contains the following values buffer: ``` validity buffer: [0, 0, 0, 0, 0, 0, 0, 0] values buffer: [0b1100, 0b1110010, 0, 0, 0, 0, 0, 0] ``` and the FieldNode has null_count = 0. I would expect this situation to yield an array without null values. was: I think that PR https://github.com/apache/arrow/pull/11709 may have caused a regression in reading IPC files. Attached is an arrow file that can't be read by the Java implementation, but it can be read by all other implementations. Its contents correspond exactly to the generated_primitive.json.gz used in integration tests. Background: The integration CI pipeline in Rust's arrow2 started failing after the PR mentioned above. The logs show that all but the Java implementation are able to consume the attached file (and more generally the files created by arrow2's implementation). The PR broke almost all tests, suggesting that it is not something specific to the file but a broader issue. Log: https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/r
[jira] [Created] (ARROW-15144) [Java] Unable to read IPC file in master
Jorge Leitão created ARROW-15144: Summary: [Java] Unable to read IPC file in master Key: ARROW-15144 URL: https://issues.apache.org/jira/browse/ARROW-15144 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Jorge Leitão Attachments: generated_primitive.arrow I think that PR https://github.com/apache/arrow/pull/11709 may have caused a regression in reading IPC files. Attached is an arrow file that can't be read by the Java implementation, but it can be read by all other implementations. Its contents correspond exactly to the generated_primitive.json.gz used in integration tests. Background: The integration CI pipeline in Rust's arrow2 started failing after the PR mentioned above. The logs show that all but the Java implementation are able to consume the attached file (and more generally the files created by arrow2's implementation). The PR broke almost all tests, suggesting that it is not something specific to the file but a broader issue. Log: https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/runs/14655/signedlogcontent/2?urlExpires=2021-12-17T05%3A35%3A25.6055769Z&urlSigningMethod=HMACV1&urlSignature=Nx7nRNdrcUCbtvOnnXAYGDEuSEJUiDT%2BU2jNcqqp%2FEs%3D The logs also suggest that the Java implementation may be leaking memory when such an event happens. {code:java} 2021-12-16T05:38:37.6833847Z 05:38:37.622 [main] ERROR org.apache.arrow.tools.Integration - Incompatible files 2021-12-16T05:38:37.6835533Z java.lang.IllegalArgumentException: Different values in column: 2021-12-16T05:38:37.6836731Z f11: Timestamp(SECOND, UTC) at index 0: null != -62135596800 2021-12-16T05:38:37.6838188Zat org.apache.arrow.vector.util.Validator.compareFieldVectors(Validator.java:133) 2021-12-16T05:38:37.6840563Zat org.apache.arrow.vector.util.Validator.compareVectorSchemaRoot(Validator.java:107) 2021-12-16T05:38:37.6842476Zat org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:209) 2021-12-16T05:38:37.6843841Zat org.apache.arrow.tools.Integration.run(Integration.java:119) 2021-12-16T05:38:37.6845214Zat org.apache.arrow.tools.Integration.main(Integration.java:70) 2021-12-16T05:38:37.6846597ZSuppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (894) 2021-12-16T05:38:37.6847623Z Allocator(ROOT) 0/894/442402/2147483647 (res/actual/peak/limit) 2021-12-16T05:38:37.6848029Z 2021-12-16T05:38:37.6848996Zat org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437) 2021-12-16T05:38:37.6851316Zat org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29) 2021-12-16T05:38:37.6882832Zat org.apache.arrow.tools.Integration$Command$3.$closeResource(Integration.java:228) 2021-12-16T05:38:37.6884294Zat org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:228) 2021-12-16T05:38:37.6885249Z... 2 common frames omitted {code} I can't discard the possibility that this is an issue in arrow2 and an undefined issue in the implementation - I am raising it here because all other implementations can read the files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15134) [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType
[ https://issues.apache.org/jira/browse/ARROW-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15134. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11975 [https://github.com/apache/arrow/pull/11975] > [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType > - > > Key: ARROW-15134 > URL: https://issues.apache.org/jira/browse/ARROW-15134 > Project: Apache Arrow > Issue Type: Sub-task > Components: GLib >Reporter: Keisuke Okada >Assignee: Keisuke Okada >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15134) [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType
[ https://issues.apache.org/jira/browse/ARROW-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-15134: - Summary: [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType (was: [GLib] Add GArrow Month, Day, Nano IntervalDataType) > [GLib] Add GArrow{Month,DayTime,MonthDayNano}IntervalDataType > - > > Key: ARROW-15134 > URL: https://issues.apache.org/jira/browse/ARROW-15134 > Project: Apache Arrow > Issue Type: Sub-task > Components: GLib >Reporter: Keisuke Okada >Assignee: Keisuke Okada >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15143) Remove incorrect comment on API of Transform for StringBinaryTransformExecBase
Eduardo Ponce created ARROW-15143: - Summary: Remove incorrect comment on API of Transform for StringBinaryTransformExecBase Key: ARROW-15143 URL: https://issues.apache.org/jira/browse/ARROW-15143 Project: Apache Arrow Issue Type: Bug Reporter: Eduardo Ponce Assignee: Eduardo Ponce Fix For: 7.0.0 The [comment describing the API of Transform method for classes that inherit from StringBinaryTransformExecBase|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L626] incorrectly states there is a last parameter for returning error status. Instead the Transform method indicates an error by returning a negative value. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15140) [CI] move to v2 of checkouts for GHA
[ https://issues.apache.org/jira/browse/ARROW-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15140. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11983 [https://github.com/apache/arrow/pull/11983] > [CI] move to v2 of checkouts for GHA > > > Key: ARROW-15140 > URL: https://issues.apache.org/jira/browse/ARROW-15140 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > https://github.com/apache/arrow/pull/11974#discussion_r770942784 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-10154) [Python][Gandiva] Add Gandiva documentation to Python docs
[ https://issues.apache.org/jira/browse/ARROW-10154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-10154: -- Assignee: (was: Will Jones) > [Python][Gandiva] Add Gandiva documentation to Python docs > -- > > Key: ARROW-10154 > URL: https://issues.apache.org/jira/browse/ARROW-10154 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva, Python >Reporter: Will Jones >Priority: Minor > Labels: documentation, gandiva, pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > I am looking at using Gandiva's Python bindings in an experimental project, > and noticed there aren't any Sphinx docs. Assuming it's not too early, I > would be willing to add documentation in Sphinx. > Two starting points for documentation: > * A how-to for creating filters and projections using the builders. The > first part of this blog post gives the right idea: > [https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/] > * API reference. If possible, it would be cool to have a table listing all > of the available functions in the registry. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-7205) [C++][Gandiva] Implement regexp_matches, regexp_like functions
[ https://issues.apache.org/jira/browse/ARROW-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-7205: - Assignee: (was: Projjal Chanda) > [C++][Gandiva] Implement regexp_matches, regexp_like functions > -- > > Key: ARROW-7205 > URL: https://issues.apache.org/jira/browse/ARROW-7205 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Projjal Chanda >Priority: Minor > Labels: pull-request-available > Time Spent: 3h 10m > Remaining Estimate: 0h > > Support following signature in Gandiva: > FunctionSignature\{name =regexp_matches, return type =Bool, param types > =[Utf8, Utf8]} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables
[ https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461106#comment-17461106 ] Will Jones commented on ARROW-15135: There is no C++ (or Rust) implementation of Iceberg; the main implementation is in Java. My impression is that writers for these table formats are about 10x more complex than the readers (but I say that having implemented a reader and not a writer :)). I think it's reasonable to start by creating the reader + dataset logic within Arrow, though once we get to the writer I do have some questions as to whether there should just be a C++ implementation managed by the Iceberg project. > [C++][R][Python] Support reading from Apache Iceberg tables > --- > > Key: ARROW-15135 > URL: https://issues.apache.org/jira/browse/ARROW-15135 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Will Jones >Priority: Major > > This is an umbrella issue for supporting the [Apache Iceberg table > format|https://iceberg.apache.org/]. > Dremio has a good overview of the format here: > https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/ -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15142) Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow
Karthik created ARROW-15142: --- Summary: Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow Key: ARROW-15142 URL: https://issues.apache.org/jira/browse/ARROW-15142 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.1 Reporter: Karthik When trying to save a Pandas dataframe with a nested type (list within list, list within dict) using pyarrow engine, the following error is encountered {color:#e75c58}ArrowInvalid{color}: ('cannot mix list and non-list, non-null values', 'Conversion failed for column A with type object') Repro: {code:java} import pandas as pd x = pd.DataFrame({"A": [[24, 27, [1, 1]]]}) x.to_parquet('/tmp/a.pqt', engine="pyarrow") {code} Doing a bit of googling, it appears that this is a known Arrow shortcoming. However, this is a commonly encountered datastructure, and 'fastparquet' handles this seamlessly. Is there a proposed timeline/plan for fixing this? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461080#comment-17461080 ] F. H. commented on ARROW-15141: --- Thanks for the context [~apitrou] , we will try your suggestion :) > [C++] Fatal error condition occurred in aws_thread_launch > - > > Key: ARROW-15141 > URL: https://issues.apache.org/jira/browse/ARROW-15141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.0, 6.0.1 > Environment: - `uname -a`: > Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` > pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge > python3.9.7 hb7a2778_3_cpythonconda-forge > python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge > python-flatbuffers1.12 pyhd8ed1ab_1conda-forge > python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge > python-rocksdb0.7.0py39h7fcd5f3_4conda-forge > python_abi3.9 2_cp39conda-forge > tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge > tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge >Reporter: F. H. >Priority: Major > > Hi, I am getting randomly the following error when first running inference > with a Tensorflow model and then writing the result to a `.parquet` file: > {code} > Fatal error condition occurred in > /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: > aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, > el_group, &thread_options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) > [0x7ffb14235f19] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) > [0x7ffb14227098] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) > [0x7ffb1406ea43] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) > [0x7ffb1406c35a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) > [0x7ffb142a2f5a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) > [0x7ffb147fd570] > /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] > /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) > [0x562576609a51] > /bin/bash: line 1: 2341494 Aborted (core dumped) > {code} > My colleague ran into the same issue on Centos 8 while running the same job + > same environment on SLURM, so I guess it could be some issue with tensorflow > + pyarrow. > Also I found a github issue with multiple people running into the same issue: > [https://github.com/huggingface/datasets/issues/3310] > > It would be very important to my lab that this bug gets resolved, as we > cannot work with parquet any more. Unfortunately, we do not have the > knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables
[ https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461078#comment-17461078 ] Will Jones commented on ARROW-15135: I agree Weston, it should just be a dataset factory. The metadata just determines the list of files. I'm not sure this is true of Iceberg, but in my experience with Delta Lake (which has a similar design) the metadata itself can become quite large. So we may want to block this work on an Avro -> Arrow reader (ARROW-1209). > [C++][R][Python] Support reading from Apache Iceberg tables > --- > > Key: ARROW-15135 > URL: https://issues.apache.org/jira/browse/ARROW-15135 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Will Jones >Priority: Major > > This is an umbrella issue for supporting the [Apache Iceberg table > format|https://iceberg.apache.org/]. > Dremio has a good overview of the format here: > https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/ -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461075#comment-17461075 ] Antoine Pitrou commented on ARROW-15141: Since you are using conda/mamba, a workaround should be to switch to an older version of aws-sdk-cpp such as {{aws-sdk-cpp=1.8.186}}. > [C++] Fatal error condition occurred in aws_thread_launch > - > > Key: ARROW-15141 > URL: https://issues.apache.org/jira/browse/ARROW-15141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.0, 6.0.1 > Environment: - `uname -a`: > Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` > pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge > python3.9.7 hb7a2778_3_cpythonconda-forge > python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge > python-flatbuffers1.12 pyhd8ed1ab_1conda-forge > python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge > python-rocksdb0.7.0py39h7fcd5f3_4conda-forge > python_abi3.9 2_cp39conda-forge > tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge > tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge >Reporter: F. H. >Priority: Major > > Hi, I am getting randomly the following error when first running inference > with a Tensorflow model and then writing the result to a `.parquet` file: > {code} > Fatal error condition occurred in > /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: > aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, > el_group, &thread_options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) > [0x7ffb14235f19] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) > [0x7ffb14227098] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) > [0x7ffb1406ea43] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) > [0x7ffb1406c35a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) > [0x7ffb142a2f5a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) > [0x7ffb147fd570] > /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] > /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) > [0x562576609a51] > /bin/bash: line 1: 2341494 Aborted (core dumped) > {code} > My colleague ran into the same issue on Centos 8 while running the same job + > same environment on SLURM, so I guess it could be some issue with tensorflow > + pyarrow. > Also I found a github issue with multiple people running into the same issue: > [https://github.com/huggingface/datasets/issues/3310] > > It would be very important to my lab that this bug gets resolved, as we > cannot work with parquet any more. Unfortunately, we do not have the > knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461073#comment-17461073 ] Antoine Pitrou commented on ARROW-15141: Thanks for the report. This is very likely this issue: https://github.com/aws/aws-sdk-cpp/issues/1809 > [C++] Fatal error condition occurred in aws_thread_launch > - > > Key: ARROW-15141 > URL: https://issues.apache.org/jira/browse/ARROW-15141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.0, 6.0.1 > Environment: - `uname -a`: > Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` > pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge > python3.9.7 hb7a2778_3_cpythonconda-forge > python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge > python-flatbuffers1.12 pyhd8ed1ab_1conda-forge > python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge > python-rocksdb0.7.0py39h7fcd5f3_4conda-forge > python_abi3.9 2_cp39conda-forge > tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge > tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge >Reporter: F. H. >Priority: Major > > Hi, I am getting randomly the following error when first running inference > with a Tensorflow model and then writing the result to a `.parquet` file: > {code} > Fatal error condition occurred in > /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: > aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, > el_group, &thread_options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) > [0x7ffb14235f19] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) > [0x7ffb14227098] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) > [0x7ffb1406ea43] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) > [0x7ffb1406c35a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) > [0x7ffb142a2f5a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) > [0x7ffb147fd570] > /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] > /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) > [0x562576609a51] > /bin/bash: line 1: 2341494 Aborted (core dumped) > {code} > My colleague ran into the same issue on Centos 8 while running the same job + > same environment on SLURM, so I guess it could be some issue with tensorflow > + pyarrow. > Also I found a github issue with multiple people running into the same issue: > [https://github.com/huggingface/datasets/issues/3310] > > It would be very important to my lab that this bug gets resolved, as we > cannot work with parquet any more. Unfortunately, we do not have the > knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15141: --- Description: Hi, I am getting randomly the following error when first running inference with a Tensorflow model and then writing the result to a `.parquet` file: {code} Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS Exiting Application Stack trace: /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x7ffb14235f19] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x7ffb14227098] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x7ffb1406ea43] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x7ffb1406c35a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x7ffb142a2f5a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x7ffb147fd570] /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) [0x562576609a51] /bin/bash: line 1: 2341494 Aborted (core dumped) {code} My colleague ran into the same issue on Centos 8 while running the same job + same environment on SLURM, so I guess it could be some issue with tensorflow + pyarrow. Also I found a github issue with multiple people running into the same issue: [https://github.com/huggingface/datasets/issues/3310] It would be very important to my lab that this bug gets resolved, as we cannot work with parquet any more. Unfortunately, we do not have the knowledge to fix it. was: Hi, I am getting randomly the following error when first running inference with a Tensorflow model and then writing the result to a `.parquet` file: ``` Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS Exiting Application Stack trace: /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x7ffb14235f19] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x7ffb14227098] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x7ffb1406ea43] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x7ffb1406c35a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x7ffb142a2f5a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x7ffb147fd570] /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) [0x562576609a51] /bin/bash: line 1: 2341494 Aborted (core dumped) ``` My colleague ran into the same issue on Centos 8 while running the same
[jira] [Updated] (ARROW-15141) [C++] Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15141: --- Summary: [C++] Fatal error condition occurred in aws_thread_launch (was: Fatal error condition occurred in aws_thread_launch) > [C++] Fatal error condition occurred in aws_thread_launch > - > > Key: ARROW-15141 > URL: https://issues.apache.org/jira/browse/ARROW-15141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.0, 6.0.1 > Environment: - `uname -a`: > Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` > pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge > python3.9.7 hb7a2778_3_cpythonconda-forge > python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge > python-flatbuffers1.12 pyhd8ed1ab_1conda-forge > python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge > python-rocksdb0.7.0py39h7fcd5f3_4conda-forge > python_abi3.9 2_cp39conda-forge > tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge > tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge >Reporter: F. H. >Priority: Major > > Hi, I am getting randomly the following error when first running inference > with a Tensorflow model and then writing the result to a `.parquet` file: > ``` > Fatal error condition occurred in > /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: > aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, > el_group, &thread_options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) > [0x7ffb14235f19] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) > [0x7ffb14227098] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) > [0x7ffb1406ea43] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) > [0x7ffb1406c35a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) > [0x7ffb142a2f5a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) > [0x7ffb147fd570] > /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] > /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) > [0x562576609a51] > /bin/bash: line 1: 2341494 Aborted (core dumped) > ``` > > My colleague ran into the same issue on Centos 8 while running the same job + > same environment on SLURM, so I guess it could be some issue with tensorflow > + pyarrow. > Also I found a github issue with multiple people running into the same issue: > [https://github.com/huggingface/datasets/issues/3310] > > It would be very important to my lab that this bug gets resolved, as we > cannot work with parquet any more. Unfortunately, we do not have the > knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15141) Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15141: --- Language: (was: Python) > Fatal error condition occurred in aws_thread_launch > --- > > Key: ARROW-15141 > URL: https://issues.apache.org/jira/browse/ARROW-15141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.0, 6.0.1 > Environment: - `uname -a`: > Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` > pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge > python3.9.7 hb7a2778_3_cpythonconda-forge > python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge > python-flatbuffers1.12 pyhd8ed1ab_1conda-forge > python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge > python-rocksdb0.7.0py39h7fcd5f3_4conda-forge > python_abi3.9 2_cp39conda-forge > tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge > tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge >Reporter: F. H. >Priority: Major > > Hi, I am getting randomly the following error when first running inference > with a Tensorflow model and then writing the result to a `.parquet` file: > ``` > Fatal error condition occurred in > /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: > aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, > el_group, &thread_options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) > [0x7ffb14235f19] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) > [0x7ffb14227098] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) > [0x7ffb1406ea43] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) > [0x7ffb1406c35a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) > [0x7ffb142a2f5a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) > [0x7ffb147fd570] > /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] > /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) > [0x562576609a51] > /bin/bash: line 1: 2341494 Aborted (core dumped) > ``` > > My colleague ran into the same issue on Centos 8 while running the same job + > same environment on SLURM, so I guess it could be some issue with tensorflow > + pyarrow. > Also I found a github issue with multiple people running into the same issue: > [https://github.com/huggingface/datasets/issues/3310] > > It would be very important to my lab that this bug gets resolved, as we > cannot work with parquet any more. Unfortunately, we do not have the > knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15141) Fatal error condition occurred in aws_thread_launch
[ https://issues.apache.org/jira/browse/ARROW-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15141: --- Component/s: C++ > Fatal error condition occurred in aws_thread_launch > --- > > Key: ARROW-15141 > URL: https://issues.apache.org/jira/browse/ARROW-15141 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 6.0.0, 6.0.1 > Environment: - `uname -a`: > Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` > pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge > python3.9.7 hb7a2778_3_cpythonconda-forge > python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge > python-flatbuffers1.12 pyhd8ed1ab_1conda-forge > python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge > python-rocksdb0.7.0py39h7fcd5f3_4conda-forge > python_abi3.9 2_cp39conda-forge > tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge > tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge > tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge >Reporter: F. H. >Priority: Major > > Hi, I am getting randomly the following error when first running inference > with a Tensorflow model and then writing the result to a `.parquet` file: > ``` > Fatal error condition occurred in > /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: > aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, > el_group, &thread_options) == AWS_OP_SUCCESS > Exiting Application > > Stack trace: > > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) > [0x7ffb14235f19] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) > [0x7ffb14227098] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) > [0x7ffb1406ea43] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) > [0x7ffb1406c35a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) > [0x7ffb14237fad] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) > [0x7ffb142a2f5a] > /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) > [0x7ffb147fd570] > /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] > /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] > /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) > [0x562576609a51] > /bin/bash: line 1: 2341494 Aborted (core dumped) > ``` > > My colleague ran into the same issue on Centos 8 while running the same job + > same environment on SLURM, so I guess it could be some issue with tensorflow > + pyarrow. > Also I found a github issue with multiple people running into the same issue: > [https://github.com/huggingface/datasets/issues/3310] > > It would be very important to my lab that this bug gets resolved, as we > cannot work with parquet any more. Unfortunately, we do not have the > knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15141) Fatal error condition occurred in aws_thread_launch
F. H. created ARROW-15141: - Summary: Fatal error condition occurred in aws_thread_launch Key: ARROW-15141 URL: https://issues.apache.org/jira/browse/ARROW-15141 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.1, 6.0.0 Environment: - `uname -a`: Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge python3.9.7 hb7a2778_3_cpythonconda-forge python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge python-flatbuffers1.12 pyhd8ed1ab_1conda-forge python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge python-rocksdb0.7.0py39h7fcd5f3_4conda-forge python_abi3.9 2_cp39conda-forge tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge Reporter: F. H. Hi, I am getting randomly the following error when first running inference with a Tensorflow model and then writing the result to a `.parquet` file: ``` Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS Exiting Application Stack trace: /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x7ffb14235f19] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x7ffb14227098] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x7ffb1406ea43] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x7ffb1406c35a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x7ffb142a2f5a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x7ffb147fd570] /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) [0x562576609a51] /bin/bash: line 1: 2341494 Aborted (core dumped) ``` My colleague ran into the same issue on Centos 8 while running the same job + same environment on SLURM, so I guess it could be some issue with tensorflow + pyarrow. Also I found a github issue with multiple people running into the same issue: [https://github.com/huggingface/datasets/issues/3310] It would be very important to my lab that this bug gets resolved, as we cannot work with parquet any more. Unfortunately, we do not have the knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15140) [CI] move to v2 of checkouts for GHA
[ https://issues.apache.org/jira/browse/ARROW-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15140: --- Labels: pull-request-available (was: ) > [CI] move to v2 of checkouts for GHA > > > Key: ARROW-15140 > URL: https://issues.apache.org/jira/browse/ARROW-15140 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/apache/arrow/pull/11974#discussion_r770942784 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15140) [CI] move to v2 of checkouts for GHA
Jonathan Keane created ARROW-15140: -- Summary: [CI] move to v2 of checkouts for GHA Key: ARROW-15140 URL: https://issues.apache.org/jira/browse/ARROW-15140 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Jonathan Keane Assignee: Jonathan Keane https://github.com/apache/arrow/pull/11974#discussion_r770942784 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache
[ https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce closed ARROW-14436. - Resolution: Cannot Reproduce > [C++] Disable color diagnostics when compiling with ccache > -- > > Key: ARROW-14436 > URL: https://issues.apache.org/jira/browse/ARROW-14436 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Eduardo Ponce >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/arrow/issues/11279] > Steps to reproduce: > Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 > CCACHE_LOGFILE=./ccache.log > Find in ./ccache.log: > Failed; falling back to running the real compiler > Result: unsupported compiler option > Dropping -fcolor-diagnostics fixes the issue. > I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or > adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. > It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also > relevant for: > -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache
[ https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461036#comment-17461036 ] Eduardo Ponce edited comment on ARROW-14436 at 12/16/21, 8:32 PM: -- Not sure what was causing the issue here. I was not able to reproduce such error, but based on the SO discussion above (btw not related to Apache Arrow), others have stumbled upon similar issues. I am closing this issue as [the reporter was able to prevent error by updating software|https://github.com/apache/arrow/pull/11518#issuecomment-949968012]. was (Author: edponce): Not sure what was causing the issue here. I was not able to reproduce such error, but based on the SO discussion above (BTW not related to Apache Arrow), others have stumbled upon similar issues. I am closing this issue as [the reporter was able to prevent error by updating software|https://github.com/apache/arrow/pull/11518#issuecomment-949968012]. > [C++] Disable color diagnostics when compiling with ccache > -- > > Key: ARROW-14436 > URL: https://issues.apache.org/jira/browse/ARROW-14436 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Eduardo Ponce >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/arrow/issues/11279] > Steps to reproduce: > Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 > CCACHE_LOGFILE=./ccache.log > Find in ./ccache.log: > Failed; falling back to running the real compiler > Result: unsupported compiler option > Dropping -fcolor-diagnostics fixes the issue. > I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or > adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. > It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also > relevant for: > -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache
[ https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461036#comment-17461036 ] Eduardo Ponce commented on ARROW-14436: --- Not sure what was causing the issue here. I was not able to reproduce such error, but based on the SO discussion above (BTW not related to Apache Arrow), others have stumbled upon similar issues. I am closing this issue as [the reporter was able to prevent error by updating software|https://github.com/apache/arrow/pull/11518#issuecomment-949968012]. > [C++] Disable color diagnostics when compiling with ccache > -- > > Key: ARROW-14436 > URL: https://issues.apache.org/jira/browse/ARROW-14436 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Eduardo Ponce >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/arrow/issues/11279] > Steps to reproduce: > Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 > CCACHE_LOGFILE=./ccache.log > Find in ./ccache.log: > Failed; falling back to running the real compiler > Result: unsupported compiler option > Dropping -fcolor-diagnostics fixes the issue. > I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or > adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. > It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also > relevant for: > -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache
[ https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Ponce updated ARROW-14436: -- Description: Copied from [https://github.com/apache/arrow/issues/11279] Steps to reproduce: Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 CCACHE_LOGFILE=./ccache.log Find in ./ccache.log: Failed; falling back to running the real compiler Result: unsupported compiler option Dropping -fcolor-diagnostics fixes the issue. I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also relevant for: -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. was: Copied from https://github.com/apache/arrow/issues/11279 Steps to reproduce: Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 CCACHE_LOGFILE=./ccache.log Find in ./ccache.log: Failed; falling back to running the real compiler Result: unsupported compiler option Dropping -fcolor-diagnostics fixes the issue. I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also relevant for: -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. I can open a PR if you tell me which way you prefer. > [C++] Disable color diagnostics when compiling with ccache > -- > > Key: ARROW-14436 > URL: https://issues.apache.org/jira/browse/ARROW-14436 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Eduardo Ponce >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/arrow/issues/11279] > Steps to reproduce: > Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 > CCACHE_LOGFILE=./ccache.log > Find in ./ccache.log: > Failed; falling back to running the real compiler > Result: unsupported compiler option > Dropping -fcolor-diagnostics fixes the issue. > I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or > adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. > It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also > relevant for: > -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14853) [C++][Python] Cryptic error message when required compute options missing
[ https://issues.apache.org/jira/browse/ARROW-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-14853. Resolution: Fixed Issue resolved by pull request 11961 [https://github.com/apache/arrow/pull/11961] > [C++][Python] Cryptic error message when required compute options missing > - > > Key: ARROW-14853 > URL: https://issues.apache.org/jira/browse/ARROW-14853 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: Kernels, pull-request-available > Fix For: 7.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This will confuse users: > {code:python} > >>> pc.partition_nth_indices([1,2,3]) > Traceback (most recent call last): > File "", line 1, in > pc.partition_nth_indices([1,2,3]) > File "/home/antoine/arrow/dev/python/pyarrow/compute.py", line 202, in > wrapper > return func.call(args, options, memory_pool) > File "pyarrow/_compute.pyx", line 327, in pyarrow._compute.Function.call > result = GetResultValue( > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > raise ArrowInvalid(message) > ArrowInvalid: Attempted to initialize KernelState from null FunctionOptions > /home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:222 > kernel->init(&kernel_ctx, {kernel, inputs, options}) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-11441) [R] Read CSV from character vector
[ https://issues.apache.org/jira/browse/ARROW-11441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461011#comment-17461011 ] Neal Richardson commented on ARROW-11441: - Weston's suggestion works: {code} > x <- c("a,b", "1,2", "3,4") > b <- buffer(charToRaw(paste(x, collapse = "\n"))) > read_csv_arrow(b) a b 1 1 2 2 3 4 {code} read_csv_arrow already can read an arrow::Buffer, we just have to put the character vector into a Buffer. There's surely a more efficient way to do that, but this would work. > [R] Read CSV from character vector > -- > > Key: ARROW-11441 > URL: https://issues.apache.org/jira/browse/ARROW-11441 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Minor > Fix For: 8.0.0 > > > `readr::read_csv()` lets you read in data from a character vector, useful for > (e.g.) taking the results of a system call and reading it in as a data.frame. > {code} > > readr::read_csv(c("a,b", "1,2", "3,4")) > # A tibble: 2 x 2 > a b > > 1 1 2 > 2 3 4 > {code} > One solution would be similar to ARROW-9235, perhaps, treating it as a > textConnection. > Another solution is to write to a tempfile. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15133) [CI] Remove util_checkout.sh and util_cleanup.sh scripts
[ https://issues.apache.org/jira/browse/ARROW-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15133: -- Assignee: Jonathan Keane > [CI] Remove util_checkout.sh and util_cleanup.sh scripts > > > Key: ARROW-15133 > URL: https://issues.apache.org/jira/browse/ARROW-15133 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > - ci/scripts/util_checkout.sh was used to checkout submodules because > actions/checkout@v2 has removed support for that, but they have restored it > since. > - ci/scripts/util_cleanup.sh was used to free up disk space on github actions > runners, because at that time it was limited to 7GB, from a recent run it > looks like the linux runners now have 32GB free space so we can try to > disable the cleanup step sparing almost a minute of build time -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15133) [CI] Remove util_checkout.sh and util_cleanup.sh scripts
[ https://issues.apache.org/jira/browse/ARROW-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15133: -- Assignee: Krisztian Szucs (was: Jonathan Keane) > [CI] Remove util_checkout.sh and util_cleanup.sh scripts > > > Key: ARROW-15133 > URL: https://issues.apache.org/jira/browse/ARROW-15133 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > - ci/scripts/util_checkout.sh was used to checkout submodules because > actions/checkout@v2 has removed support for that, but they have restored it > since. > - ci/scripts/util_cleanup.sh was used to free up disk space on github actions > runners, because at that time it was limited to 7GB, from a recent run it > looks like the linux runners now have 32GB free space so we can try to > disable the cleanup step sparing almost a minute of build time -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15133) [CI] Remove util_checkout.sh and util_cleanup.sh scripts
[ https://issues.apache.org/jira/browse/ARROW-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15133. Resolution: Fixed Issue resolved by pull request 11974 [https://github.com/apache/arrow/pull/11974] > [CI] Remove util_checkout.sh and util_cleanup.sh scripts > > > Key: ARROW-15133 > URL: https://issues.apache.org/jira/browse/ARROW-15133 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > - ci/scripts/util_checkout.sh was used to checkout submodules because > actions/checkout@v2 has removed support for that, but they have restored it > since. > - ci/scripts/util_cleanup.sh was used to free up disk space on github actions > runners, because at that time it was limited to 7GB, from a recent run it > looks like the linux runners now have 32GB free space so we can try to > disable the cleanup step sparing almost a minute of build time -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13554) [C++] Remove deprecated Scanner::Scan
[ https://issues.apache.org/jira/browse/ARROW-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460997#comment-17460997 ] Weston Pace commented on ARROW-13554: - Yes, I think this and ARROW-13338 will be really good to get into 7.0.0. > [C++] Remove deprecated Scanner::Scan > - > > Key: ARROW-13554 > URL: https://issues.apache.org/jira/browse/ARROW-13554 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Weston Pace >Priority: Minor > Fix For: 7.0.0 > > > It was already slated for removal in 5.0.0. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace closed ARROW-1. --- Resolution: Duplicate > [C++] Implement task-based model into the executable-pipelines. > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 8.0.0 > > > The plan fragments will use a task-based model for parallel processing. > That means that there is a scheduler that gets assigned tasks for execution, > keeps selecting the next task for execution, and assigning it to a specific > executor (aka arrow thread pool). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460992#comment-17460992 ] Weston Pace commented on ARROW-1: - I think this is superceded by ARROW-15079 which has more description > [C++] Implement task-based model into the executable-pipelines. > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 8.0.0 > > > The plan fragments will use a task-based model for parallel processing. > That means that there is a scheduler that gets assigned tasks for execution, > keeps selecting the next task for execution, and assigning it to a specific > executor (aka arrow thread pool). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14289) [C++] Change Scanner::Head to return a RecordBatchReader
[ https://issues.apache.org/jira/browse/ARROW-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace reassigned ARROW-14289: --- Assignee: Weston Pace > [C++] Change Scanner::Head to return a RecordBatchReader > > > Key: ARROW-14289 > URL: https://issues.apache.org/jira/browse/ARROW-14289 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Neal Richardson >Assignee: Weston Pace >Priority: Major > Fix For: 7.0.0 > > > Following ARROW-9731 and ARROW-13893. This would make it more natural to work > with ExecPlans that return a RecordBatchReader when you Run them. > Alternatively, we could move the business to RecordBatchReader::Head. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14705) [C++] unify_schemas can't handle int64 + double, affects CSV dataset
[ https://issues.apache.org/jira/browse/ARROW-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-14705: Assignee: David Li > [C++] unify_schemas can't handle int64 + double, affects CSV dataset > > > Key: ARROW-14705 > URL: https://issues.apache.org/jira/browse/ARROW-14705 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Neal Richardson >Assignee: David Li >Priority: Major > Labels: query-engine > > Twitter question of "how can I make arrow's csv reader not make int64 for > integers", turns out to be originating from the scenario where some csvs in a > directory may have all integer values for a column but there are decimals in > others, and you can't use them together in a dataset. > {code:r} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > ds_dir <- tempfile() > dir.create(ds_dir) > cat("a\n1", file = file.path(ds_dir, "1.csv")) > cat("a\n1.1", file = file.path(ds_dir, "2.csv")) > ds <- open_dataset(ds_dir, format = "csv") > ds > #> FileSystemDataset with 2 csv files > #> a: int64 > ## It just picked the schema of the first file > collect(ds) > #> Error: Invalid: Could not open CSV input source > '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx0gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': > Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid > value '1.1' > #> ../src/arrow/csv/converter.cc:492 decoder_.Decode(data, size, quoted, > &value) > #> ../src/arrow/csv/parser.h:123 status > #> ../src/arrow/csv/converter.cc:496 parser.VisitColumn(col_index, visit) > #> ../src/arrow/csv/reader.cc:462 > internal::UnwrapOrRaise(maybe_decoded_arrays) > #> ../src/arrow/compute/exec/exec_plan.cc:398 iterator_.Next() > #> ../src/arrow/record_batch.cc:318 ReadNext(&batch) > #> ../src/arrow/record_batch.cc:329 ReadAll(&batches) > ## Let's try again and tell it to unify schemas. Should result in a float64 > type > ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE) > #> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs > double > #> ../src/arrow/type.cc:1621 fields_[i]->MergeWith(field) > #> ../src/arrow/type.cc:1684 AddField(field) > #> ../src/arrow/type.cc:1755 builder.AddSchema(schema) > #> ../src/arrow/dataset/discovery.cc:251 Inspect(options.inspect_options) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables
[ https://issues.apache.org/jira/browse/ARROW-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460970#comment-17460970 ] Weston Pace commented on ARROW-15135: - So off the top of my head I think this integration might take the form of a dataset factory: The dataset factory would, given an iceberg table, consult the iceberg metadata. From that metadata we can get: - The list of files - The format of the files (Parquet vs Orc) - The partitioning scheme - Potentially the filesystem? We could then take those three things and create an ordinary FileSystemDataset. Alternatively, we could create an IcebergDataset and IcebergFragment but I'm not sure there would be anything to gain by doing so. > [C++][R][Python] Support reading from Apache Iceberg tables > --- > > Key: ARROW-15135 > URL: https://issues.apache.org/jira/browse/ARROW-15135 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Will Jones >Priority: Major > > This is an umbrella issue for supporting the [Apache Iceberg table > format|https://iceberg.apache.org/]. > Dremio has a good overview of the format here: > https://www.dremio.com/apache-iceberg-an-architectural-look-under-the-covers/ -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-10209) [Python] support positional arguments for options in compute wrapper
[ https://issues.apache.org/jira/browse/ARROW-10209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-10209. Resolution: Fixed Issue resolved by pull request 11955 [https://github.com/apache/arrow/pull/11955] > [Python] support positional arguments for options in compute wrapper > > > Key: ARROW-10209 > URL: https://issues.apache.org/jira/browse/ARROW-10209 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Maarten Breddels >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > As mentioned here: > [https://github.com/apache/arrow/pull/8271#discussion_r500897047] > we cannot support > {code:java} > pc.split_pattern(arr, "---") > {code} > where the second argument is a positional argument of the FunctionObject > class. > I think it makes sense for a small subset (like this function) to support > non-keyword arguments. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15044) [C++] Add simple stdout/JSON exporter for OpenTelemetry
[ https://issues.apache.org/jira/browse/ARROW-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-15044. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11925 [https://github.com/apache/arrow/pull/11925] > [C++] Add simple stdout/JSON exporter for OpenTelemetry > --- > > Key: ARROW-15044 > URL: https://issues.apache.org/jira/browse/ARROW-15044 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available, query-engine > Fix For: 7.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > While it is odd for a library to configure OpenTelemetry exporters, there is > a reason for this: the actual end-user is unable to configure it directly (as > we use the C++ implementation, and the user is likely in Python or R), and > often the "important" bits are all in C++, so it would be nice to have a way > to dump the information collected. > On top of that, it would be nice to dump the data in an easily parseable > format for further consumption. > We should enable the ostream exporter, as well as a custom exporter that > dumps JSON. See > [https://github.com/open-telemetry/opentelemetry-cpp/pull/] for an > example. See discussion on [https://github.com/apache/arrow/pull/11906] as > well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12526) [Python] Pre-generate pyarrow.compute members
[ https://issues.apache.org/jira/browse/ARROW-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460958#comment-17460958 ] Antoine Pitrou commented on ARROW-12526: [~alippai] Can you show an example invocation of pylint or another such utility? > [Python] Pre-generate pyarrow.compute members > -- > > Key: ARROW-12526 > URL: https://issues.apache.org/jira/browse/ARROW-12526 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 4.0.0 >Reporter: Adam Lippai >Priority: Minor > Fix For: 7.0.0 > > > Static analysis tools (e.g. pylint) don't recognize simple members like > pyarrow.compute.equal, they report is as _missing_. Generating file (well a > file imported by this file I assume) > [https://github.com/apache/arrow/blob/master/python/pyarrow/compute.py] > instead of runtime wrapping of the functions would improve the developer > experience. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-8998) [Python] Make NumPy an optional runtime dependency
[ https://issues.apache.org/jira/browse/ARROW-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8998: -- Issue Type: Improvement (was: New Feature) > [Python] Make NumPy an optional runtime dependency > -- > > Key: ARROW-8998 > URL: https://issues.apache.org/jira/browse/ARROW-8998 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 8.0.0 > > > Since in the relatively near future, one will be able to do non-trivial > analytical operations and query processing natively on Arrow data structures > through pyarrow, it does not make sense to require users to always install > NumPy when they install pyarrow. I propose to split the NumPy-depending parts > of libarrow_python into a libarrow_numpy (which also must be bundled) and > moving this part of the codebase into a separate Cython module. > This refactoring should be relatively painless though there may be a number > of packaging details to chase up since this would introduce a new shared > library to be installed in various packaging targets. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-8998) [Python] Make NumPy an optional runtime dependency
[ https://issues.apache.org/jira/browse/ARROW-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8998: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [Python] Make NumPy an optional runtime dependency > -- > > Key: ARROW-8998 > URL: https://issues.apache.org/jira/browse/ARROW-8998 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 8.0.0 > > > Since in the relatively near future, one will be able to do non-trivial > analytical operations and query processing natively on Arrow data structures > through pyarrow, it does not make sense to require users to always install > NumPy when they install pyarrow. I propose to split the NumPy-depending parts > of libarrow_python into a libarrow_numpy (which also must be bundled) and > moving this part of the codebase into a separate Cython module. > This refactoring should be relatively painless though there may be a number > of packaging details to chase up since this would introduce a new shared > library to be installed in various packaging targets. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-7174) [Python] Expose parquet dictionary size write parameter in python.
[ https://issues.apache.org/jira/browse/ARROW-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460957#comment-17460957 ] Antoine Pitrou commented on ARROW-7174: --- [~jorisvandenbossche] > [Python] Expose parquet dictionary size write parameter in python. > -- > > Key: ARROW-7174 > URL: https://issues.apache.org/jira/browse/ARROW-7174 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Micah Kornfield >Priority: Major > Labels: good-second-issue > Fix For: 7.0.0 > > > In some cases it might be useful to have dictionaries larger then the current > default 1MB. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-13598) [C++] Deprecate Datum::COLLECTION
[ https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-13598. -- Resolution: Fixed Issue resolved by pull request 11979 [https://github.com/apache/arrow/pull/11979] > [C++] Deprecate Datum::COLLECTION > - > > Key: ARROW-13598 > URL: https://issues.apache.org/jira/browse/ARROW-13598 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > It looks like "collection" datums are not used anywhere. Where we want to > return several pieces of data, we generally return a Struct array or scalar > wrapping them. > Perhaps we should simply deprecate or even remove them. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-2075) [Python] Add documentation section for integrations with PyTorch, TensorFlow
[ https://issues.apache.org/jira/browse/ARROW-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-2075: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [Python] Add documentation section for integrations with PyTorch, TensorFlow > > > Key: ARROW-2075 > URL: https://issues.apache.org/jira/browse/ARROW-2075 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Wes McKinney >Priority: Major > Fix For: 8.0.0 > > > This should also include any advice about issues with TensorFlow such as > packaging issues. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15139) [Python] write_dataset's file_write_options are too confusing and/or undocumented
Weston Pace created ARROW-15139: --- Summary: [Python] write_dataset's file_write_options are too confusing and/or undocumented Key: ARROW-15139 URL: https://issues.apache.org/jira/browse/ARROW-15139 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Weston Pace ParquetFileWriteOptions / IpcFileWriteOptions / CsvFileWriteOptions are completely undocumented. Also, creating them via the format is a little peculiar / confusing and I'm not sure why it is required. It should be clear to the user what options they can set and how to set them. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15060) [R] open_dataset() on csv files lacks support for compressed files
[ https://issues.apache.org/jira/browse/ARROW-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460947#comment-17460947 ] Carl Boettiger commented on ARROW-15060: My apologies, but it looks like it is read_csv_arrow over S3 which fails: {code:java} ## arrow network csv.gz bench::bench_time({ # ERROR analysis <- arrow::s3_bucket(bucket = "analysis", endpoint_override = "data.ecoforecast.org", anonymous = TRUE) ds <- arrow::read_csv_arrow(analysis$OpenInputFile("combined_forecasts_scores.csv.gz")) }) {code} The above example is on a public bucket, should be reproducible. No problems if I use `read_csv_arrow()` locally. Relatedly, does `write_csv_arrow()` gzip a csv if we add the ".gz" to the name? Or is there a way to toggle compression when using `write_dataset` with `format="csv"` ? > [R] open_dataset() on csv files lacks support for compressed files > -- > > Key: ARROW-15060 > URL: https://issues.apache.org/jira/browse/ARROW-15060 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Carl Boettiger >Priority: Major > > Using open_dataset() on S3 buckets of csv files is a game-changing magic, > particularly with all the additional support for database / dplyr operations > over the remote connection, and the widespread adoption of S3 buckets even by > old-school big data providers like NOAA. > > It's not uncommon to encounter buckets with *.csv.gz formats. I know > technically this should be unnecessary, as compression can be done "in > flight" by the server, but usually this is not an issue for R users since R's > `connection` class automatically detects and gunzips compressed files (over > either POSIX or HTTP connections). It would be really great if arrow could > handle this case too. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-10142) [C++] RecordBatchStreamReader should use StreamDecoder
[ https://issues.apache.org/jira/browse/ARROW-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-10142: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] RecordBatchStreamReader should use StreamDecoder > -- > > Key: ARROW-10142 > URL: https://issues.apache.org/jira/browse/ARROW-10142 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 8.0.0 > > > There's no reason to duplicate some of the stream reading logic, and re-using > StreamDecoder would ensure the behaviour of both classes matches. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13554) [C++] Remove deprecated Scanner::Scan
[ https://issues.apache.org/jira/browse/ARROW-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460940#comment-17460940 ] Antoine Pitrou commented on ARROW-13554: [~westonpace] Do you think this can be done soon? > [C++] Remove deprecated Scanner::Scan > - > > Key: ARROW-13554 > URL: https://issues.apache.org/jira/browse/ARROW-13554 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Weston Pace >Priority: Minor > Fix For: 7.0.0 > > > It was already slated for removal in 5.0.0. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-9441) [C++] Optimize RecordBatchReader::ReadAll
[ https://issues.apache.org/jira/browse/ARROW-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9441: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Optimize RecordBatchReader::ReadAll > - > > Key: ARROW-9441 > URL: https://issues.apache.org/jira/browse/ARROW-9441 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Ji Liu >Priority: Major > Fix For: 8.0.0 > > > Based on perf reports, more time is spent manipulating C++ data structures > than reconstructing record batches from IPC messages, which strikes me as not > what we want > here is from a perf report based on the Python code > {code} > for i in range(100): > pa.ipc.open_stream('nyctaxi.arrow').read_all() > {code} > {code} > - 50.40% 0.06% python libarrow.so.100.0.0 > [.] arrow::RecordBatchReader::ReadAll >- 50.34% arrow::RecordBatchReader::ReadAll > - 25.86% arrow::Table::FromRecordBatches > - 18.41% arrow::SimpleRecordBatch::column > - 16.00% arrow::MakeArray >- 10.49% > arrow::VisitTypeInline > 7.71% arrow::PrimitiveArray::SetData > 1.87% arrow::StringArray::StringArray >1.54% __pthread_mutex_lock >0.88% __pthread_mutex_unlock >0.67% std::_Hash_bytes >0.60% arrow::ChunkedArray::ChunkedArray > - 22.30% arrow::RecordBatchReader::ReadAll > - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext > - 15.91% arrow::ipc::ReadRecordBatchInternal >- 15.15% arrow::ipc::LoadRecordBatch > - 14.45% arrow::ipc::ArrayLoader::Load > + 13.15% arrow::VisitTypeInline > + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage > 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch > {code} > Perhaps {{ChunkedArray}} internally should be changed to contain a vector of > {{ArrayData}} instead of boxed Arrays. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction
[ https://issues.apache.org/jira/browse/ARROW-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8936: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Parallelize execution of arrow::compute::ScalarFunction > - > > Key: ARROW-8936 > URL: https://issues.apache.org/jira/browse/ARROW-8936 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-9441) [C++] Optimize IPC stream reading
[ https://issues.apache.org/jira/browse/ARROW-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9441: -- Summary: [C++] Optimize IPC stream reading (was: [C++] Optimize RecordBatchReader::ReadAll) > [C++] Optimize IPC stream reading > - > > Key: ARROW-9441 > URL: https://issues.apache.org/jira/browse/ARROW-9441 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Ji Liu >Priority: Major > Fix For: 8.0.0 > > > Based on perf reports, more time is spent manipulating C++ data structures > than reconstructing record batches from IPC messages, which strikes me as not > what we want > here is from a perf report based on the Python code > {code} > for i in range(100): > pa.ipc.open_stream('nyctaxi.arrow').read_all() > {code} > {code} > - 50.40% 0.06% python libarrow.so.100.0.0 > [.] arrow::RecordBatchReader::ReadAll >- 50.34% arrow::RecordBatchReader::ReadAll > - 25.86% arrow::Table::FromRecordBatches > - 18.41% arrow::SimpleRecordBatch::column > - 16.00% arrow::MakeArray >- 10.49% > arrow::VisitTypeInline > 7.71% arrow::PrimitiveArray::SetData > 1.87% arrow::StringArray::StringArray >1.54% __pthread_mutex_lock >0.88% __pthread_mutex_unlock >0.67% std::_Hash_bytes >0.60% arrow::ChunkedArray::ChunkedArray > - 22.30% arrow::RecordBatchReader::ReadAll > - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext > - 15.91% arrow::ipc::ReadRecordBatchInternal >- 15.15% arrow::ipc::LoadRecordBatch > - 14.45% arrow::ipc::ArrayLoader::Load > + 13.15% arrow::VisitTypeInline > + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage > 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch > {code} > Perhaps {{ChunkedArray}} internally should be changed to contain a vector of > {{ArrayData}} instead of boxed Arrays. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction
[ https://issues.apache.org/jira/browse/ARROW-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8936: -- Issue Type: Improvement (was: New Feature) > [C++] Parallelize execution of arrow::compute::ScalarFunction > - > > Key: ARROW-8936 > URL: https://issues.apache.org/jira/browse/ARROW-8936 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information
[ https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460937#comment-17460937 ] David Li commented on ARROW-15138: -- ExecPlan already does a topological sort of the nodes so you could probably infer indentation there too. (I thought there was an issue about getting rid of multiple sinks? Though I am playing with a node that has two outputs so I can send data both to a Flight stream and a file writer.) > [C++] Make ExecPlan::ToString give some additional information > -- > > Key: ARROW-15138 > URL: https://issues.apache.org/jira/browse/ARROW-15138 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Priority: Minor > Labels: compute, engine > > Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 > However, that task is very open about the scope. So this ticket will try to > address some improvements for ExecPlan::ToString without the need to have > additional sophistication nor to extend the scope of features for > ExecPlan::ToString. > This means that we won't need to research about GraphViz (or other backends) > here. We will just need to work to have these features for ExecPlan::ToString: > * Print index/Id for each node in the physical tree > * Make sure all the nodes have names and correct index/IDs > * Apply indentation for the printing process (this may be an optional > setting in the API) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information
[ https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460934#comment-17460934 ] Weston Pace commented on ARROW-15138: - I guess for indentation you could do djikstra's shortest path from the sink (all costs are 1, there is always only one sink I'm pretty sure) and then base the indentation on that (highest cost is least indented) but I don't know how visually appealing that would be. > [C++] Make ExecPlan::ToString give some additional information > -- > > Key: ARROW-15138 > URL: https://issues.apache.org/jira/browse/ARROW-15138 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Priority: Minor > Labels: compute, engine > > Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 > However, that task is very open about the scope. So this ticket will try to > address some improvements for ExecPlan::ToString without the need to have > additional sophistication nor to extend the scope of features for > ExecPlan::ToString. > This means that we won't need to research about GraphViz (or other backends) > here. We will just need to work to have these features for ExecPlan::ToString: > * Print index/Id for each node in the physical tree > * Make sure all the nodes have names and correct index/IDs > * Apply indentation for the printing process (this may be an optional > setting in the API) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460933#comment-17460933 ] Antoine Pitrou commented on ARROW-1: [~westonpace] Is this something that needs to be done, or a suggestion? > [C++] Implement task-based model into the executable-pipelines. > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 7.0.0 > > > The plan fragments will use a task-based model for parallel processing. > That means that there is a scheduler that gets assigned tasks for execution, > keeps selecting the next task for execution, and assigning it to a specific > executor (aka arrow thread pool). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14444) [C++] Implement task-based model into the executable-pipelines.
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-1: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Implement task-based model into the executable-pipelines. > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Alexander Ocsa >Priority: Major > Labels: query-engine > Fix For: 8.0.0 > > > The plan fragments will use a task-based model for parallel processing. > That means that there is a scheduler that gets assigned tasks for execution, > keeps selecting the next task for execution, and assigning it to a specific > executor (aka arrow thread pool). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information
[ https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460932#comment-17460932 ] Weston Pace commented on ARROW-15138: - Some of this stuff we may already do in the current ToString. We have a "label" which is the "id" and, if the caller doesn't assign it, we already assign it based on a plan scoped counter (look for auto_label_counter_ in exec_plan.cc). So that handles the first bullet. I'm pretty sure most of the nodes have the correct name. For plans that aren't trees we should print the node the first time we come across it and then, on future times we will come across it, print some kind of reference to the node based on the id/label. For example, we could just throw a # in front of the label (#7). > [C++] Make ExecPlan::ToString give some additional information > -- > > Key: ARROW-15138 > URL: https://issues.apache.org/jira/browse/ARROW-15138 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Priority: Minor > Labels: compute, engine > > Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 > However, that task is very open about the scope. So this ticket will try to > address some improvements for ExecPlan::ToString without the need to have > additional sophistication nor to extend the scope of features for > ExecPlan::ToString. > This means that we won't need to research about GraphViz (or other backends) > here. We will just need to work to have these features for ExecPlan::ToString: > * Print index/Id for each node in the physical tree > * Make sure all the nodes have names and correct index/IDs > * Apply indentation for the printing process (this may be an optional > setting in the API) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13310) [C++] Implement hash_aggregate mode kernel
[ https://issues.apache.org/jira/browse/ARROW-13310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13310: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Implement hash_aggregate mode kernel > -- > > Key: ARROW-13310 > URL: https://issues.apache.org/jira/browse/ARROW-13310 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > Labels: kernel > Fix For: 8.0.0 > > > Requires ARROW-12759. > We have a scalar aggregate kernel for this already and hopefully the > implementation can be reused. Note, Pandas actually doesn't expose this in > DataFrameGroupBy. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-7798) [R] Refactor R <-> Array conversion
[ https://issues.apache.org/jira/browse/ARROW-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-7798: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Refactor R <-> Array conversion > --- > > Key: ARROW-7798 > URL: https://issues.apache.org/jira/browse/ARROW-7798 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Francois Saint-Jacques >Assignee: Romain Francois >Priority: Major > Fix For: 8.0.0 > > > There's a bit of technical debt accumulated in array_to_vector and > vector_to_array: > * Mix of conversion *and* casting, ideally we'd move casting out of there > (at the cost of more memory copy). The rationale is that the conversion logic > will differ from the CastKernels, e.g. when to raise errors, benefits from > complex conversions like timezone... The current implementation is fast, e.g. > it fuses the conversion and casting in a single loop at the cost of code > clarity and divergence. > * There should be 2 paths, zero-copy, non zero-copy. The non-zero copy > should use the newly introduced VectorToArrayConverter which will work with > complex nested types. > * The in array_to vector, Converter should work primarily with Array and not > ArrayVector > * The vector_to_array should not use builders, sizes are known, the null > bitmap should be constructed separately. There's probably a chance that we > can re-use R's memory with zero-copy for the raw data. > * There seem to be multiple paths that do the same conversion: > [https://github.com/apache/arrow/pull/7514#discussion_r446706140] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13339) [C++] Implement hash_aggregate kernels (umbrella issue)
[ https://issues.apache.org/jira/browse/ARROW-13339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13339: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Implement hash_aggregate kernels (umbrella issue) > --- > > Key: ARROW-13339 > URL: https://issues.apache.org/jira/browse/ARROW-13339 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > Labels: kernel > Fix For: 8.0.0 > > > Umbrella issue to track implementation of hash aggregate kernels. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13309) [C++] Implement hash_aggregate exact quantile kernel
[ https://issues.apache.org/jira/browse/ARROW-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13309: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Implement hash_aggregate exact quantile kernel > > > Key: ARROW-13309 > URL: https://issues.apache.org/jira/browse/ARROW-13309 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > Labels: kernel > Fix For: 8.0.0 > > > Requires ARROW-12759. > We have a scalar aggregate kernel for this already and hopefully the > implementation can be reused. > Also create a pair of kernels {{median()}} and {{hash_median()}} that wrap > this for convenient use by bindings (like in ARROW-14052). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-8470) [Python][R] Expose incremental write API for Feather files
[ https://issues.apache.org/jira/browse/ARROW-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-8470: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [Python][R] Expose incremental write API for Feather files > -- > > Key: ARROW-8470 > URL: https://issues.apache.org/jira/browse/ARROW-8470 > Project: Apache Arrow > Issue Type: Improvement > Components: Python, R >Reporter: Wes McKinney >Priority: Minor > Fix For: 8.0.0 > > > This is already available for writing IPC files, so this would mostly be an > interface to that with the addition of logic to handle conversions from > Python or R data frames and splitting the inputs based on the configured > Feather chunksize -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-9235) [R] Support for `connection` class when reading and writing files
[ https://issues.apache.org/jira/browse/ARROW-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-9235: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Support for `connection` class when reading and writing files > - > > Key: ARROW-9235 > URL: https://issues.apache.org/jira/browse/ARROW-9235 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Michael Quinn >Priority: Major > Fix For: 8.0.0 > > > We have an internal filesystem that we interact with through objects that > inherit from the connection class. These files aren't necessarily local, > making it slightly more complicated to read and write parquet files, for > example. > For now, we're generating raw vectors and using that to create the file. For > example, to read files > {noformat} > ReadParquet <- function(filename, ...) {}} > file <-file(filename,"rb") > on.exit(close(file)) > raw <- readBin(file, "raw", FileInfo(filename)$size) > return(arrow::read_parquet(raw, ...)) > } > {noformat} > And to write, > {noformat} > WriteParquet <- function(df, filepath, ...) { > stream <- BufferOutputStream$create() > write_parquet(df, stream, ...) > raw <- stream$finish()$data() >file <- file(filepath, "wb") > on.exit(close(file) > writeBin(raw, file) > return(invisible()) > } > {noformat} > At the C++ level, we are interacting with ` R_new_custom_connection` defined > here: > [https://github.com/wch/r-source/blob/trunk/src/include/R_ext/Connections.h] > I've been very impressed with how feature-rich arrow is. It would be nice to > see this API supported as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13530) [C++] Implement cumulative sum compute function
[ https://issues.apache.org/jira/browse/ARROW-13530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-13530: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Implement cumulative sum compute function > --- > > Key: ARROW-13530 > URL: https://issues.apache.org/jira/browse/ARROW-13530 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Labels: good-first-issue, good-second-issue, kernel > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-9293) [R] Add chunk_size to Table$create()
[ https://issues.apache.org/jira/browse/ARROW-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-9293: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Add chunk_size to Table$create() > > > Key: ARROW-9293 > URL: https://issues.apache.org/jira/browse/ARROW-9293 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Romain Francois >Priority: Major > Fix For: 8.0.0 > > > While working on ARROW-3308, I noticed that write_feather has a chunk_size > argument, which by default will write batches of 64k rows into the file. In > principle, a chunking strategy like this would prevent the need to bump up to > large_utf8 when ingesting a large character vector because you'd end up with > many chunks that each fit into a regular utf8 type. However, the way the > function works, the data.frame is converted to a Table with all ChunkedArrays > containing a single chunk first, which is where the large_utf8 type gets set. > But if Table$create() could be instructed to make multiple chunks, this would > be resolved. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-11071) [R][CI] Use processx to set up minio and flight servers in tests
[ https://issues.apache.org/jira/browse/ARROW-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-11071: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R][CI] Use processx to set up minio and flight servers in tests > > > Key: ARROW-11071 > URL: https://issues.apache.org/jira/browse/ARROW-11071 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Minor > Fix For: 8.0.0 > > > Rather than rely on them being set up outside of the tests. processx is > already a transitive test dependency (testthat uses it) so there's no reason > for us not to. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)
[ https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-1888: - Assignee: (was: Fernando Rodriguez) > [C++] Implement casts from one struct type to another (with same field names > and number of fields) > -- > > Key: ARROW-1888 > URL: https://issues.apache.org/jira/browse/ARROW-1888 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: analytics, good-first-issue, kernel > Fix For: 7.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-11441) [R] Read CSV from character vector
[ https://issues.apache.org/jira/browse/ARROW-11441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-11441: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Read CSV from character vector > -- > > Key: ARROW-11441 > URL: https://issues.apache.org/jira/browse/ARROW-11441 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Minor > Fix For: 8.0.0 > > > `readr::read_csv()` lets you read in data from a character vector, useful for > (e.g.) taking the results of a system call and reading it in as a data.frame. > {code} > > readr::read_csv(c("a,b", "1,2", "3,4")) > # A tibble: 2 x 2 > a b > > 1 1 2 > 2 3 4 > {code} > One solution would be similar to ARROW-9235, perhaps, treating it as a > textConnection. > Another solution is to write to a tempfile. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-11699) [R] Implement dplyr::across()
[ https://issues.apache.org/jira/browse/ARROW-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-11699: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Implement dplyr::across() > - > > Key: ARROW-11699 > URL: https://issues.apache.org/jira/browse/ARROW-11699 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 8.0.0 > > > It's not a generic, but because it seems only to be called inside of > functions like `mutate()`, we can insert our own version of it into the NSE > data mask -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-11755) [R] Add tests from dplyr/test-mutate.r
[ https://issues.apache.org/jira/browse/ARROW-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-11755: --- Fix Version/s: (was: 7.0.0) > [R] Add tests from dplyr/test-mutate.r > -- > > Key: ARROW-11755 > URL: https://issues.apache.org/jira/browse/ARROW-11755 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Minor > Labels: pull-request-available > Time Spent: 7h 10m > Remaining Estimate: 0h > > Review > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r > and port tests over to arrow as needed to see if there are edge cases we > aren't covering appropriately. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-11841) [R][C++] Allow cancelling long-running commands
[ https://issues.apache.org/jira/browse/ARROW-11841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-11841: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R][C++] Allow cancelling long-running commands > --- > > Key: ARROW-11841 > URL: https://issues.apache.org/jira/browse/ARROW-11841 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Antoine Pitrou >Priority: Major > Fix For: 8.0.0 > > > When calling a long-running task (for example reading a CSV file) from the R > prompt, users may want to interrupt with Ctrl-C. > Allowing this will require integrating R's user interruption facility with > the cancellation API that's going to be exposed in C++ (see ARROW-8732). > Below some information I've gathered on the topic: > There is some hairy discussion of how to interrupt C++ code from R at > https://stackoverflow.com/questions/40563522/r-how-to-write-interruptible-c-function-and-recover-partial-results > and https://stat.ethz.ch/pipermail/r-devel/2011-April/060714.html . > It seems it may involve polling cpp11::check_user_interrupt() and catching > any cpp11::unwind_exception that may signal an interruption. A complication > is that apparently R APIs should only be called from the main thread. There's > also a small library which claims to make writing all this easier: > https://github.com/tnagler/RcppThread/blob/master/inst/include/RcppThread/RMonitor.hpp > But since user interruptions will only be noticed by the R main thread, the > solution may be to launch heavy computations (e.g. CSV reading) in a separate > thread and have the main R thread periodically poll for interrupts while > waiting for the separate thread. This is what this dedicated thread class > does in its join method: > https://github.com/tnagler/RcppThread/blob/master/inst/include/RcppThread/Thread.hpp#L79 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-12059) [R] Accept format-specific scan options in collect()
[ https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12059: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Accept format-specific scan options in collect() > > > Key: ARROW-12059 > URL: https://issues.apache.org/jira/browse/ARROW-12059 > Project: Apache Arrow > Issue Type: Task > Components: R >Affects Versions: 4.0.0 >Reporter: David Li >Priority: Major > Labels: dataset, datasets > Fix For: 8.0.0 > > > ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most > natural place to accept these is in collect(), but this isn't yet done. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-12105) [R] Replace vars_select, vars_rename with eval_select, eval_rename
[ https://issues.apache.org/jira/browse/ARROW-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12105: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Replace vars_select, vars_rename with eval_select, eval_rename > -- > > Key: ARROW-12105 > URL: https://issues.apache.org/jira/browse/ARROW-12105 > Project: Apache Arrow > Issue Type: Task > Components: R >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 8.0.0 > > > The tidyselect {{vars_select}} and {{vars_rename}} functions are on their way > to being deprecated. Switch to using the newer {{eval_select}} and > {{eval_rename}} functions. This is required to use the {{where()}} selection > helper (ARROW-12778) Unfortunately this is not a trivial task. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features
[ https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12137: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] New/improved vignette on dplyr features > --- > > Key: ARROW-12137 > URL: https://issues.apache.org/jira/browse/ARROW-12137 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15117) [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects
[ https://issues.apache.org/jira/browse/ARROW-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15117: --- Labels: pull-request-available (was: ) > [Docs] Splitting the sphinx-based Arrow docs into separate sphinx projects > -- > > Key: ARROW-15117 > URL: https://issues.apache.org/jira/browse/ARROW-15117 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > See the mailing list > (https://mail-archives.apache.org/mod_mbox/arrow-dev/202112.mbox/%3CCALQtMBbiasQtXYc46kpw-TyQ-TQSPjNQ5%2BkoREuKvJ3hJSdWjw%40mail.gmail.com%3E) > and this google doc > (https://docs.google.com/document/d/1AXDNwU5CSnZ1cSeUISwy_xgvTzoYWeuqWApC8UEv97Q/edit?usp=sharing) > for more context. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-12213) [R] copy_files doesn't make it easy to copy a single file
[ https://issues.apache.org/jira/browse/ARROW-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12213: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] copy_files doesn't make it easy to copy a single file > - > > Key: ARROW-12213 > URL: https://issues.apache.org/jira/browse/ARROW-12213 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Neal Richardson >Priority: Major > Fix For: 8.0.0 > > > copy_files (i.e. fs::CopyFiles) makes it trivial to recursively copy a > directory/bucket to or from S3, but I'm having a hard time downloading a > single file. > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-12778) [R] Support tidyselect where() selection helper in dplyr verbs
[ https://issues.apache.org/jira/browse/ARROW-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12778: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Support tidyselect where() selection helper in dplyr verbs > -- > > Key: ARROW-12778 > URL: https://issues.apache.org/jira/browse/ARROW-12778 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Fix For: 8.0.0 > > > Since we can now determine the data type of an unevaluated array expression > (ARROW-12291) I think we should be able to support the {{where()}} selection > helper. > This is already done for the {{relocate()}} verb (in ARROW-12781 ) but not > for any other verbs. > Steps required to do this: > # ARROW-12781 > # ARROW-12105 > # Remove the {{check_select_helpers()}} function definition and remove all > the calls to it > # Modify any remaining the {{expect_error()}} tests that test {{where()}} > and check for the error message {{"Unsupported selection helper"}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-12711) [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()
[ https://issues.apache.org/jira/browse/ARROW-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12711: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Bindings for paste(collapse), str_c(collapse), and str_flatten() > > > Key: ARROW-12711 > URL: https://issues.apache.org/jira/browse/ARROW-12711 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Ian Cook >Priority: Major > Labels: query-engine > Fix For: 8.0.0 > > > These are the aggregating versions of string concatenation—they combine > values from a set of rows into a single value. > The bindings for {{paste()}} and {{str_c()}} might be tricky to implement > because when these functions are called with the {{coallapse}} argument > unset, they do _not_ aggregate. > In {{summarise()}} we need to be able to use scalar concatenation within > aggregate concatenation, like this: > {code:java} > starwars %>% > filter(!is.na(hair_color) & !is.na(eye_color)) %>% > group_by(homeworld) %>% > summarise(hair_and_eyes = paste0(paste0(hair_color, "-haired and ", > eye_color, "-eyed"), collapse = ", ")){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-8626) [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool
[ https://issues.apache.org/jira/browse/ARROW-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8626: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool > --- > > Key: ARROW-8626 > URL: https://issues.apache.org/jira/browse/ARROW-8626 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Weston Pace >Priority: Major > Fix For: 8.0.0 > > > Currently, when submitting tasks to a thread pool, they are all commingled in > a common queue. When a new task submitter shows up, they must wait in the > back of the line behind all other queued tasks. > A simple alternative to this would be round-robin scheduling, where each new > consumer is assigned a unique integer id, and the schedule / thread pool > internally maintains the tasks associated with the consumer in separate > queues. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13118) [R] Improve handling of R scalars in some nse_funcs
[ https://issues.apache.org/jira/browse/ARROW-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13118: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Improve handling of R scalars in some nse_funcs > --- > > Key: ARROW-13118 > URL: https://issues.apache.org/jira/browse/ARROW-13118 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Fix For: 8.0.0 > > > Some of the functions in {{nse_funcs}} do not behave properly when passed R > scalar input in expressions in dplyr verbs. Some examples: > {code:r} > Table$create(x = 1) %>% mutate(as.character(42)) > Table$create(x = 1) %>% mutate(is.character(("foo"))) > Table$create(x = 1) %>% mutate(nchar("foo")) > Table$create(x = 1) %>% mutate(is.infinite(Inf)) > {code} > This could be resolved by using {{build_expr()}} instead of > {{Expression$create()}}, but {{build_expr()}} is somewhat heavy. The only > part of it we really need to make this work is this: > {code:r} > args <- lapply(args, function(x) { > if (!inherits(x, "Expression")) { > x <- Expression$scalar(x) > } > x > }){code} > If {{build_expr()}} is too heavy, we could make a function called > {{wrap_r_scalar}}, like this: > {code:r} > wrap_r_scalar <- function(x) { > if (!inherits(x "Expression")) { > assert_that( > length(x) == 1, > msg = "Literal vectors of length != 1 not supported" > ) > Expression$scalar(x) > } else { > x > } > } > {code} > and use it as needed in various of the {{nse_funcs}} functions. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13337) [R] Define Math group generics
[ https://issues.apache.org/jira/browse/ARROW-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13337: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Define Math group generics > -- > > Key: ARROW-13337 > URL: https://issues.apache.org/jira/browse/ARROW-13337 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 8.0.0 > > > Following ARROW-13280 we now support log, trigonometry, and other functions > that are in the {{Math}} S3 generic, so we should define it like how we've > already done for {{Ops}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group
[ https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460929#comment-17460929 ] Antoine Pitrou commented on ARROW-13993: Since the result would be non-deterministic, I'm not sure I understand the point of a {{hash_take}} function compared to the {{hash_first}} proposal. > [C++] Hash aggregate function that returns value from first row in group > > > Key: ARROW-13993 > URL: https://issues.apache.org/jira/browse/ARROW-13993 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Priority: Major > Labels: good-second-issue, kernel > Fix For: 7.0.0 > > > It would be nice to have a hash aggregate function that returns the first > value of a column within each hash group. > If row order within groups is non-deterministic, then effectively this would > return one arbitrary value. This is a very computationally cheap operation. > This can be quite useful when querying a non-normalized table. For example if > you have a table with a {{country}} column and also a {{country_abbr}} column > and you want to group by either/both of those columns but return the values > from both columns, you could do > {code:java} > SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code} > but it would be more efficient to do > {code:java} > SELECT country, first(country_abbr) FROM table GROUP BY country{code} > because then the engine does not need to scan all the values of the > {{country_abbr}} column. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache
[ https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-14436: --- Priority: Minor (was: Major) > [C++] Disable color diagnostics when compiling with ccache > -- > > Key: ARROW-14436 > URL: https://issues.apache.org/jira/browse/ARROW-14436 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Eduardo Ponce >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Copied from https://github.com/apache/arrow/issues/11279 > Steps to reproduce: > Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 > CCACHE_LOGFILE=./ccache.log > Find in ./ccache.log: > Failed; falling back to running the real compiler > Result: unsupported compiler option > Dropping -fcolor-diagnostics fixes the issue. > I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or > adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. > It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also > relevant for: > -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. > I can open a PR if you tell me which way you prefer. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13854) [R] More accurately determine output type of an aggregation expression
[ https://issues.apache.org/jira/browse/ARROW-13854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13854: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] More accurately determine output type of an aggregation expression > -- > > Key: ARROW-13854 > URL: https://issues.apache.org/jira/browse/ARROW-13854 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Minor > Fix For: 8.0.0 > > > ARROW-13740 added some basic heuristics but the C++ library surely knows the > actual types, we just need to find out how to ask it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13767) [R] Add Arrow methods slice(), slice_head(), slice_tail()
[ https://issues.apache.org/jira/browse/ARROW-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13767: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Add Arrow methods slice(), slice_head(), slice_tail() > - > > Key: ARROW-13767 > URL: https://issues.apache.org/jira/browse/ARROW-13767 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Labels: query-engine > Fix For: 8.0.0 > > > Implement [{{slice()}}, {{slice_head()}}, and > {{slice_tail()}}|https://dplyr.tidyverse.org/reference/slice.html] methods > for {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects . I > believe this should be relatively straightforward, using {{Take()}} to return > only the specified rows. We already have a {{head()}} method which I believe > we can reuse for {{slice_head()}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14436) [C++] Disable color diagnostics when compiling with ccache
[ https://issues.apache.org/jira/browse/ARROW-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460927#comment-17460927 ] Antoine Pitrou commented on ARROW-14436: I'm a bit surprised about this. I use ccache and have no problem with color diagnostics. Is it just because of an old version of ccache? > [C++] Disable color diagnostics when compiling with ccache > -- > > Key: ARROW-14436 > URL: https://issues.apache.org/jira/browse/ARROW-14436 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Eduardo Ponce >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Copied from https://github.com/apache/arrow/issues/11279 > Steps to reproduce: > Compile arrow_objlib with ccache, clang and CCACHE_DEBUG=1 > CCACHE_LOGFILE=./ccache.log > Find in ./ccache.log: > Failed; falling back to running the real compiler > Result: unsupported compiler option > Dropping -fcolor-diagnostics fixes the issue. > I suggest either opting into color diagnostics with WITH_COLOR_DIAGNOSTICS or > adding a way to disable it via DISABLE_COLOR_DIAGNOSTICS. > It would be good if this wouldn't be tied to ARROW_USE_CCACHE since its also > relevant for: > -DARROW_USE_CCACHE=OFF -DCMAKE_CXX_COMPILER_LAUNCHER=emscripten_ccache. > I can open a PR if you tell me which way you prefer. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13926) [R] Support mutate/summarize with implicit join
[ https://issues.apache.org/jira/browse/ARROW-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13926: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Support mutate/summarize with implicit join > --- > > Key: ARROW-13926 > URL: https://issues.apache.org/jira/browse/ARROW-13926 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 8.0.0 > > > {code} > mtcars %>% > group_by(cyl) %>% > mutate(x = hp - mean(hp) > {code} > essentially means something like > {code} > mtcars %>% > left_join(mtcars %>% > group_by(cyl) %>% > summarize(tmp = mean(hp)) > ) %>% > mutate(x = hp - tmp) %>% > select(-tmp) > {code} > Apparently you can do the same inside summarize() too (though IDK if that's > behavior we want to encourage). Once we can do joins, we can support these > queries. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information
[ https://issues.apache.org/jira/browse/ARROW-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460926#comment-17460926 ] David Li commented on ARROW-15138: -- One thing that's been requested is to have a way to rebuild an exec plan printed in one language (e.g. so that we can get an ExecPlan from R into C++ for debugging). Doing that fully automatically is out of scope here (though: do we want some way to serialize an ExecPlan to/from an internal format? Or is that redundant with Substrait support?) but we should ensure all necessary data is included. For options like indentation, we could add an (optional) options struct for those things. (It may also be hard in the general case because plans are not necessarily trees.) Otherwise I agree with the suggested improvements. One more thing to consider might be to ensure we generate unique names in case there are duplicate nodes (I think we already do so since we number them?) > [C++] Make ExecPlan::ToString give some additional information > -- > > Key: ARROW-15138 > URL: https://issues.apache.org/jira/browse/ARROW-15138 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Percy Camilo Triveño Aucahuasi >Priority: Minor > Labels: compute, engine > > Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 > However, that task is very open about the scope. So this ticket will try to > address some improvements for ExecPlan::ToString without the need to have > additional sophistication nor to extend the scope of features for > ExecPlan::ToString. > This means that we won't need to research about GraphViz (or other backends) > here. We will just need to work to have these features for ExecPlan::ToString: > * Print index/Id for each node in the physical tree > * Make sure all the nodes have names and correct index/IDs > * Apply indentation for the printing process (this may be an optional > setting in the API) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14028) [R] Cast of NaN to integer should return NA_integer_
[ https://issues.apache.org/jira/browse/ARROW-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14028: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Cast of NaN to integer should return NA_integer_ > > > Key: ARROW-14028 > URL: https://issues.apache.org/jira/browse/ARROW-14028 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Fix For: 8.0.0 > > > Casting double {{NaN}} to integer returns a sentinel value: > {code:r} > call_function("cast", Scalar$create(NaN), options = list(to_type = int32(), > allow_float_truncate = TRUE)) > #> Scalar > #> -2147483648 > call_function("cast", Scalar$create(NaN), options = list(to_type = int64(), > allow_float_truncate = TRUE)) > #> Scalar > #> -9223372036854775808{code} > It would be nice if this would instead return {{NA_integer}}. > N.B. for some reason this doesn't reproduce in dplyr unless you round-trip it > back to double: > {code:r} > > Table$create(x = NaN) %>% transmute(as.double(as.integer(x))) %>% pull(1) > #> [1] -2147483648{code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14021) [R] Bind median() and quantile() to exact not approximate median and quantile
[ https://issues.apache.org/jira/browse/ARROW-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14021: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Bind median() and quantile() to exact not approximate median and quantile > - > > Key: ARROW-14021 > URL: https://issues.apache.org/jira/browse/ARROW-14021 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Fix For: 8.0.0 > > > ARROW-13772 binds {{quantile()}} to {{tdigest()}} which returns approximate > quantiles and binds {{median()}} to {{approximate_median()}} which returns an > approximate median. The bindings issue a warning saying that the > median/quantile is approximate. Once ARROW-13309 is implemented, modify the > binding to call Arrow functions that returns exact quantiles and medians, and > remove the warnings. > We should keep the approximate quantile and median bindings but rename them. > When doing this, we should also modify the bindings to accept {{type}} and > {{interpolation}} arguments like we do in the {{quantile.ArrowDatum}} method: > https://github.com/apache/arrow/blob/170a24fc5f9f6f9c7c7d15a716c5995f03204a4b/r/R/compute.R#L156-L187 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13598) [C++] Deprecate Datum::COLLECTION
[ https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13598: --- Labels: pull-request-available (was: ) > [C++] Deprecate Datum::COLLECTION > - > > Key: ARROW-13598 > URL: https://issues.apache.org/jira/browse/ARROW-13598 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > It looks like "collection" datums are not used anywhere. Where we want to > return several pieces of data, we generally return a Struct array or scalar > wrapping them. > Perhaps we should simply deprecate or even remove them. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()
[ https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14045: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Support for .keep_all = TRUE with distinct() > - > > Key: ARROW-14045 > URL: https://issues.apache.org/jira/browse/ARROW-14045 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > Fix For: 8.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14085) [R] Expose null placement option through sort bindings
[ https://issues.apache.org/jira/browse/ARROW-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14085: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Expose null placement option through sort bindings > -- > > Key: ARROW-14085 > URL: https://issues.apache.org/jira/browse/ARROW-14085 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: kernel > Fix For: 8.0.0 > > > ARROW-12063 added a null placement option to the sort kernels and to > {{OrderBySinkNode}} in the C++ library. Expose this through the R bindings. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14071) [R] Try to arrow_eval user-defined functions
[ https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14071: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Try to arrow_eval user-defined functions > > > Key: ARROW-14071 > URL: https://issues.apache.org/jira/browse/ARROW-14071 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Dewey Dunnington >Priority: Major > Fix For: 8.0.0 > > > The first test passes but the second one fails, even though they're > equivalent. The user's function isn't being evaluated in the nse_funcs > environment. > {code} > expect_dplyr_equal( > input %>% > select(-fct) %>% > filter(nchar(padded_strings) < 10) %>% > collect(), > tbl > ) > isShortString <- function(x) nchar(x) < 10 > expect_dplyr_equal( > input %>% > select(-fct) %>% > filter(isShortString(padded_strings)) %>% > collect(), > tbl > ) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14199) [R] bindings for format where possible
[ https://issues.apache.org/jira/browse/ARROW-14199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14199: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] bindings for format where possible > -- > > Key: ARROW-14199 > URL: https://issues.apache.org/jira/browse/ARROW-14199 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Jonathan Keane >Assignee: Dewey Dunnington >Priority: Major > Fix For: 8.0.0 > > > Now that we have {{strftime}}, we should also be able to make bindings for > {{format()}} as well. This might be complicated / we might need to punt on a > bunch of types that {{format()}} can take but arrow doesn't (yet) support > formatting of them, that's ok. > Though some of those might be wrappable with a handful of kernels stacked > together: {{format(float)}} might be round + cast to character -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14159) [R] Re-allow some multithreading on Windows
[ https://issues.apache.org/jira/browse/ARROW-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14159: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Re-allow some multithreading on Windows > --- > > Key: ARROW-14159 > URL: https://issues.apache.org/jira/browse/ARROW-14159 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 8.0.0 > > > Followup to ARROW-8379, which set use_threads = FALSE on Windows. See > discussion about adding more controls, disabling threading in some places and > not others, etc. We want to do this soon after release so that we have a few > months to see how things behave on CI before releasing again. > - > Collecting some CI hangs after ARROW-8379 > 1. Rtools35, 64bit test suite hangs: > https://github.com/apache/arrow/pull/11290/checks?check_run_id=3767787034 > {code} > ** running tests for arch 'i386' ... > Running 'testthat.R' [17s] > OK > ** running tests for arch 'x64' ... > Error: Error: stderr is not a pipe.> > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14185) [C++] HashJoinNode should validate HashJoinOptions
[ https://issues.apache.org/jira/browse/ARROW-14185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-14185: --- Labels: good-first-issue query-engine (was: query-engine) > [C++] HashJoinNode should validate HashJoinOptions > -- > > Key: ARROW-14185 > URL: https://issues.apache.org/jira/browse/ARROW-14185 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Priority: Major > Labels: good-first-issue, query-engine > Fix For: 7.0.0 > > > At the very least key_cmp needs to match the number of join columns, there > may be other invariants to uphold. Probably all the ones from the [semi-join > PR|https://github.com/apache/arrow/pull/10845/files#diff-1028a02ecee91ae8a9fbb7318e1ce650760716d317457d48e0740c802287fbdbR38]: > there must be the same, non-zero number of keys on both sides. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-9842) [C++] Explore alternative strategy for Compare kernel implementation for better performance
[ https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9842: -- Fix Version/s: 8.0.0 (was: 7.0.0) > [C++] Explore alternative strategy for Compare kernel implementation for > better performance > --- > > Key: ARROW-9842 > URL: https://issues.apache.org/jira/browse/ARROW-9842 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 8.0.0 > > Attachments: movemask-in-chunks.diff, movemask.patch > > > The compiler may be able to vectorize comparison options if the bitpacking of > results is deferred until the end (or in chunks). Instead, a temporary > bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can > be bitpacked into the output buffer. This may also reduce the code size of > the compare kernels (which are actually quite large at the moment) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14264) [R] Support inequality joins
[ https://issues.apache.org/jira/browse/ARROW-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14264: --- Fix Version/s: 8.0.0 (was: 7.0.0) > [R] Support inequality joins > > > Key: ARROW-14264 > URL: https://issues.apache.org/jira/browse/ARROW-14264 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Labels: query-engine > Fix For: 8.0.0 > > > We'll need this not-yet-merged dplyr API to do it: > https://github.com/tidyverse/dplyr/pull/5910 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15137) [Dev] Update archery crossbow latest-prefix to work with nightly dates
[ https://issues.apache.org/jira/browse/ARROW-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15137: -- Fix Version/s: 7.0.0 > [Dev] Update archery crossbow latest-prefix to work with nightly dates > -- > > Key: ARROW-15137 > URL: https://issues.apache.org/jira/browse/ARROW-15137 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)