[jira] [Created] (ARROW-15144) [Java] Unable to read IPC file in master
Jorge Leitão created ARROW-15144: Summary: [Java] Unable to read IPC file in master Key: ARROW-15144 URL: https://issues.apache.org/jira/browse/ARROW-15144 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Jorge Leitão Attachments: generated_primitive.arrow I think that PR https://github.com/apache/arrow/pull/11709 may have caused a regression in reading IPC files. Attached is an arrow file that can't be read by the Java implementation, but it can be read by all other implementations. Its contents correspond exactly to the generated_primitive.json.gz used in integration tests. Background: The integration CI pipeline in Rust's arrow2 started failing after the PR mentioned above. The logs show that all but the Java implementation are able to consume the attached file (and more generally the files created by arrow2's implementation). The PR broke almost all tests, suggesting that it is not something specific to the file but a broader issue. Log: https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/runs/14655/signedlogcontent/2?urlExpires=2021-12-17T05%3A35%3A25.6055769Z&urlSigningMethod=HMACV1&urlSignature=Nx7nRNdrcUCbtvOnnXAYGDEuSEJUiDT%2BU2jNcqqp%2FEs%3D The logs also suggest that the Java implementation may be leaking memory when such an event happens. {code:java} 2021-12-16T05:38:37.6833847Z 05:38:37.622 [main] ERROR org.apache.arrow.tools.Integration - Incompatible files 2021-12-16T05:38:37.6835533Z java.lang.IllegalArgumentException: Different values in column: 2021-12-16T05:38:37.6836731Z f11: Timestamp(SECOND, UTC) at index 0: null != -62135596800 2021-12-16T05:38:37.6838188Zat org.apache.arrow.vector.util.Validator.compareFieldVectors(Validator.java:133) 2021-12-16T05:38:37.6840563Zat org.apache.arrow.vector.util.Validator.compareVectorSchemaRoot(Validator.java:107) 2021-12-16T05:38:37.6842476Zat org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:209) 2021-12-16T05:38:37.6843841Zat org.apache.arrow.tools.Integration.run(Integration.java:119) 2021-12-16T05:38:37.6845214Zat org.apache.arrow.tools.Integration.main(Integration.java:70) 2021-12-16T05:38:37.6846597ZSuppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (894) 2021-12-16T05:38:37.6847623Z Allocator(ROOT) 0/894/442402/2147483647 (res/actual/peak/limit) 2021-12-16T05:38:37.6848029Z 2021-12-16T05:38:37.6848996Zat org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437) 2021-12-16T05:38:37.6851316Zat org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29) 2021-12-16T05:38:37.6882832Zat org.apache.arrow.tools.Integration$Command$3.$closeResource(Integration.java:228) 2021-12-16T05:38:37.6884294Zat org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:228) 2021-12-16T05:38:37.6885249Z... 2 common frames omitted {code} I can't discard the possibility that this is an issue in arrow2 and an undefined issue in the implementation - I am raising it here because all other implementations can read the files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15143) Remove incorrect comment on API of Transform for StringBinaryTransformExecBase
Eduardo Ponce created ARROW-15143: - Summary: Remove incorrect comment on API of Transform for StringBinaryTransformExecBase Key: ARROW-15143 URL: https://issues.apache.org/jira/browse/ARROW-15143 Project: Apache Arrow Issue Type: Bug Reporter: Eduardo Ponce Assignee: Eduardo Ponce Fix For: 7.0.0 The [comment describing the API of Transform method for classes that inherit from StringBinaryTransformExecBase|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L626] incorrectly states there is a last parameter for returning error status. Instead the Transform method indicates an error by returning a negative value. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15142) Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow
Karthik created ARROW-15142: --- Summary: Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow Key: ARROW-15142 URL: https://issues.apache.org/jira/browse/ARROW-15142 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.1 Reporter: Karthik When trying to save a Pandas dataframe with a nested type (list within list, list within dict) using pyarrow engine, the following error is encountered {color:#e75c58}ArrowInvalid{color}: ('cannot mix list and non-list, non-null values', 'Conversion failed for column A with type object') Repro: {code:java} import pandas as pd x = pd.DataFrame({"A": [[24, 27, [1, 1]]]}) x.to_parquet('/tmp/a.pqt', engine="pyarrow") {code} Doing a bit of googling, it appears that this is a known Arrow shortcoming. However, this is a commonly encountered datastructure, and 'fastparquet' handles this seamlessly. Is there a proposed timeline/plan for fixing this? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15141) Fatal error condition occurred in aws_thread_launch
F. H. created ARROW-15141: - Summary: Fatal error condition occurred in aws_thread_launch Key: ARROW-15141 URL: https://issues.apache.org/jira/browse/ARROW-15141 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.1, 6.0.0 Environment: - `uname -a`: Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - `mamba list | grep -i "pyarrow\|tensorflow\|^python"` pyarrow 6.0.0 py39hff6fa39_1_cpuconda-forge python3.9.7 hb7a2778_3_cpythonconda-forge python-dateutil 2.8.2 pyhd8ed1ab_0conda-forge python-flatbuffers1.12 pyhd8ed1ab_1conda-forge python-irodsclient1.0.0 pyhd8ed1ab_0conda-forge python-rocksdb0.7.0py39h7fcd5f3_4conda-forge python_abi3.9 2_cp39conda-forge tensorflow2.6.2 cuda112py39h9333c2f_0conda-forge tensorflow-base 2.6.2 cuda112py39h7de589b_0conda-forge tensorflow-estimator 2.6.2 cuda112py39h9333c2f_0conda-forge tensorflow-gpu2.6.2 cuda112py39h0bbbad9_0conda-forge Reporter: F. H. Hi, I am getting randomly the following error when first running inference with a Tensorflow model and then writing the result to a `.parquet` file: ``` Fatal error condition occurred in /home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS Exiting Application Stack trace: /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59) [0x7ffb14235f19] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48) [0x7ffb14227098] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43) [0x7ffb1406ea43] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a) [0x7ffb1406c35a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d) [0x7ffb14237fad] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a) [0x7ffb142a2f5a] /home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570) [0x7ffb147fd570] /lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba] /home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) [0x562576609a51] /bin/bash: line 1: 2341494 Aborted (core dumped) ``` My colleague ran into the same issue on Centos 8 while running the same job + same environment on SLURM, so I guess it could be some issue with tensorflow + pyarrow. Also I found a github issue with multiple people running into the same issue: [https://github.com/huggingface/datasets/issues/3310] It would be very important to my lab that this bug gets resolved, as we cannot work with parquet any more. Unfortunately, we do not have the knowledge to fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15140) [CI] move to v2 of checkouts for GHA
Jonathan Keane created ARROW-15140: -- Summary: [CI] move to v2 of checkouts for GHA Key: ARROW-15140 URL: https://issues.apache.org/jira/browse/ARROW-15140 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Jonathan Keane Assignee: Jonathan Keane https://github.com/apache/arrow/pull/11974#discussion_r770942784 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15139) [Python] write_dataset's file_write_options are too confusing and/or undocumented
Weston Pace created ARROW-15139: --- Summary: [Python] write_dataset's file_write_options are too confusing and/or undocumented Key: ARROW-15139 URL: https://issues.apache.org/jira/browse/ARROW-15139 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Weston Pace ParquetFileWriteOptions / IpcFileWriteOptions / CsvFileWriteOptions are completely undocumented. Also, creating them via the format is a little peculiar / confusing and I'm not sure why it is required. It should be clear to the user what options they can set and how to set them. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information
Percy Camilo Triveño Aucahuasi created ARROW-15138: -- Summary: [C++] Make ExecPlan::ToString give some additional information Key: ARROW-15138 URL: https://issues.apache.org/jira/browse/ARROW-15138 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Percy Camilo Triveño Aucahuasi Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 However, that task is very open about the scope. So this ticket will try to address some improvements for ExecPlan::ToString without the need to have additional sophistication nor to extend the scope of features for ExecPlan::ToString. This means that we won't need to research about GraphViz (or other backends) here. We will just need to work to have these features for ExecPlan::ToString: * Print index/Id for each node in the physical tree * Make sure all the nodes have names and correct index/IDs * Apply indentation for the printing process (this may be an optional setting in the API) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15137) [Dev] Update archery crossbow latest-prefix to work with nightly dates
Joris Van den Bossche created ARROW-15137: - Summary: [Dev] Update archery crossbow latest-prefix to work with nightly dates Key: ARROW-15137 URL: https://issues.apache.org/jira/browse/ARROW-15137 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15136) [C++] Make S3FS tests faster
Antoine Pitrou created ARROW-15136: -- Summary: [C++] Make S3FS tests faster Key: ARROW-15136 URL: https://issues.apache.org/jira/browse/ARROW-15136 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou One thing that would make development more pleasant would be to speed up the s3fs tests. Currently, each test waits for the launch of a new Minio child process. Instead, we may launch several of them in advance (listening on different ports and targetting different temporary directories), so that most tests don't bear the launch cost. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables
Will Jones created ARROW-15135: -- Summary: [C++][R][Python] Support reading from Apache Iceberg tables Key: ARROW-15135 URL: https://issues.apache.org/jira/browse/ARROW-15135 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Will Jones This is an umbrella issue for supporting the [Apache Iceberg table format|] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15134) [GLib] Add GArrow Month, Day, Nano IntervalDataType
Keisuke Okada created ARROW-15134: - Summary: [GLib] Add GArrow Month, Day, Nano IntervalDataType Key: ARROW-15134 URL: https://issues.apache.org/jira/browse/ARROW-15134 Project: Apache Arrow Issue Type: Sub-task Components: GLib Reporter: Keisuke Okada -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15133) [CI] Removing util_checkout.sh and util_cleanup.sh scripts
Krisztian Szucs created ARROW-15133: --- Summary: [CI] Removing util_checkout.sh and util_cleanup.sh scripts Key: ARROW-15133 URL: https://issues.apache.org/jira/browse/ARROW-15133 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Fix For: 7.0.0 - ci/scripts/util_checkout.sh was used to checkout submodules because actions/checkout@v2 has removed support for that, but they have restored it since. - ci/scripts/util_cleanup.sh was used to free up disk space on github actions runners, because at that time it was limited to 7GB, from a recent run it looks like the linux runners now have 32GB free space so we can try to disable the cleanup step sparing almost a minute of build time -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15132) Remove wrong use of typename in greedy_dual_size_cache.h
Projjal Chanda created ARROW-15132: -- Summary: Remove wrong use of typename in greedy_dual_size_cache.h Key: ARROW-15132 URL: https://issues.apache.org/jira/browse/ARROW-15132 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Projjal Chanda -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15131) [Python] Coerce value_set argument to array in "is_in" kernel
Joris Van den Bossche created ARROW-15131: - Summary: [Python] Coerce value_set argument to array in "is_in" kernel Key: ARROW-15131 URL: https://issues.apache.org/jira/browse/ARROW-15131 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Small example I ran into: {code:python} >>> arr = pa.array(['a', 'b', 'c', 'd']) >>> pc.is_in(arr, ['a', 'c']) ... TypeError: "['a', 'c']" is not a valid value set {code} That's not a super friendly error message (it was not directly clear what is not "valid" about this). Passing {{pa.array(['a', 'c']) explicitly works, but I expected that the kernel would try this automatically (as we also convert the first array argument to an array). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15130) [Doc] Annotation Glossary
Alenka Frim created ARROW-15130: --- Summary: [Doc] Annotation Glossary Key: ARROW-15130 URL: https://issues.apache.org/jira/browse/ARROW-15130 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Alenka Frim Add Annotation Glossary for Apache Arrow project. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15129) [Doc][C++] Building and Memory allocators
Alenka Frim created ARROW-15129: --- Summary: [Doc][C++] Building and Memory allocators Key: ARROW-15129 URL: https://issues.apache.org/jira/browse/ARROW-15129 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Alenka Frim Add a section about memory allocation options in the building process of the C++ library. The content could fit under the section [Build Dependency Management|https://arrow.apache.org/docs/developers/cpp/building.html#build-dependency-management|[http://example.com].|http://example.com]/] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15128) [C++] segfault when writing CSV from RecordBatchReader
Nicola Crane created ARROW-15128: Summary: [C++] segfault when writing CSV from RecordBatchReader Key: ARROW-15128 URL: https://issues.apache.org/jira/browse/ARROW-15128 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane I'm currently trying to implement functionality in R so that we can open a dataset and then write to a CSV file, but I'm getting a segfault when I run my tests: {code:r} tbl <- tibble::tibble( dbl = c(1:8, NA, 10) + .1, lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), false = logical(10), chr = letters[c(1:5, NA, 7:10)] ) make_temp_dir <- function() { path <- tempfile() dir.create(path) normalizePath(path, winslash = "/") } data_dir <- make_temp_dir() write_dataset(tbl, data_dir, partitioning = "lgl") data_in <- open_dataset(data_dir) csv_file <- tempfile() tbl_out <- write_csv_arrow(data_in, csv_file) {code} {code:java} Thread 1 "R" received signal SIGSEGV, Segmentation fault. 0x7fffee51fdd7 in __gnu_cxx::__exchange_and_add (__mem=0xe9, __val=-1) at /usr/include/c++/9/ext/atomicity.h:49 49{ return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)