[jira] [Created] (ARROW-18095) [CI][C++][MinGW] All tests exited with 0xc0000139
Kouhei Sutou created ARROW-18095: Summary: [CI][C++][MinGW] All tests exited with 0xc139 Key: ARROW-18095 URL: https://issues.apache.org/jira/browse/ARROW-18095 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://github.com/apache/arrow/actions/runs/3261682270/jobs/5357126875 {noformat} + ctest --label-regex unittest --output-on-failure --parallel 2 --timeout 300 --exclude-regex 'gandiva-internals-test|gandiva-projector-test|gandiva-utf8-test|gandiva-binary-test|gandiva-boolean-expr-test|gandiva-date-time-test|gandiva-decimal-single-test|gandiva-decimal-test|gandiva-filter-project-test|gandiva-filter-test|gandiva-hash-test|gandiva-if-expr-test|gandiva-in-expr-test|gandiva-literal-test|gandiva-null-validity-test|gandiva-precompiled-test|gandiva-projector-test' Test project D:/a/arrow/arrow/build/cpp Start 1: arrow-array-test Start 2: arrow-buffer-test 1/67 Test #2: arrow-buffer-test .Exit code 0xc139 ***Exception: 0.15 sec Start 3: arrow-extension-type-test 2/67 Test #1: arrow-array-test ..Exit code 0xc139 ***Exception: 0.17 sec Start 4: arrow-misc-test 3/67 Test #3: arrow-extension-type-test .Exit code 0xc139 ***Exception: 0.04 sec 39 - arrow-dataset-discovery-test (Exit code 0xc139 ) 40 - arrow-dataset-file-ipc-test (Exit code 0xc139 ) 41 - arrow-dataset-file-test (Exit code 0xc139 ) 42 - arrow-dataset-partition-test (Exit code 0xc139 ) 43 - arrow-dataset-scanner-test (Exit code 0xc139 ) 44 - arrow-dataset-file-csv-test (Exit code 0xc139 ) 45 - arrow-dataset-file-parquet-test (Exit code 0xc139 ) 46 - arrow-filesystem-test (Exit code 0xc139 ) Errors while running CTest 47 - arrow-gcsfs-test (Exit code 0xc139 ) 48 - arrow-s3fs-test (Exit code 0xc139 ) 49 - arrow-flight-internals-test (Exit code 0xc139 ) 50 - arrow-flight-test (Exit code 0xc139 ) 51 - arrow-flight-sql-test (Exit code 0xc139 ) 52 - arrow-feather-test (Exit code 0xc139 ) 53 - arrow-ipc-json-simple-test (Exit code 0xc139 ) 54 - arrow-ipc-read-write-test (Exit code 0xc139 ) 55 - arrow-ipc-tensor-test (Exit code 0xc139 ) 56 - arrow-json-test (Exit code 0xc139 ) 57 - parquet-internals-test (Exit code 0xc139 ) 58 - parquet-reader-test (Exit code 0xc139 ) 59 - parquet-writer-test (Exit code 0xc139 ) 60 - parquet-arrow-test (Exit code 0xc139 ) 61 - parquet-arrow-internals-test (Exit code 0xc139 ) 62 - parquet-encryption-test (Exit code 0xc139 ) 63 - parquet-encryption-key-management-test (Exit code 0xc139 ) 64 - parquet-file-deserialize-test (Exit code 0xc139 ) 65 - parquet-schema-test (Exit code 0xc139 ) 66 - gandiva-projector-build-validation-test (Exit code 0xc139 ) 67 - gandiva-to-string-test (Exit code 0xc139 ) Error: Process completed with exit code 8. {noformat} The last succeeded job: https://github.com/apache/arrow/actions/runs/3256683017/jobs/5347422431 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18094) [Dev][CI] Make nightly group as an alias of nightly-*
Kouhei Sutou created ARROW-18094: Summary: [Dev][CI] Make nightly group as an alias of nightly-* Key: ARROW-18094 URL: https://issues.apache.org/jira/browse/ARROW-18094 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou We use {{nightly-*}} groups not {{nightly}} group for our nightly CI. So we need to use {{crossbow submit -g nightly-tests -g nightly-packaging -g nightly-release}} to run nightly jobs when we want to run nightly jobs before we merge a pull request. But it's inconvenient and easy to mistake. For example, some developers use {{crossbow submit -g nightly}} to run nightly jobs. How about make {{nightly}} group as an alias of {{nightly-*}} groups to improve developer experience? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18093) [CI][Conda][Windows] Failed with missing ORC
Kouhei Sutou created ARROW-18093: Summary: [CI][Conda][Windows] Failed with missing ORC Key: ARROW-18093 URL: https://issues.apache.org/jira/browse/ARROW-18093 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=37759&view=logs&j=4c86bc1b-1091-5192-4404-c74dfaad23e7&t=41795ef0-6501-5db4-3ad4-33c0cf085626&l=497 {noformat} CMake Error at cmake_modules/FindORC.cmake:56 (message): ORC library was required in toolchain and unable to locate Call Stack (most recent call first): cmake_modules/ThirdpartyToolchain.cmake:280 (find_package) cmake_modules/ThirdpartyToolchain.cmake:4362 (resolve_dependency) CMakeLists.txt:496 (include) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18092) [CI][Conan] Failed with gRPC related dependency resolution failure
Kouhei Sutou created ARROW-18092: Summary: [CI][Conan] Failed with gRPC related dependency resolution failure Key: ARROW-18092 URL: https://issues.apache.org/jira/browse/ARROW-18092 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://github.com/ursacomputing/crossbow/actions/runs/3271941831/jobs/5382341820#step:5:566 {noformat} WARN: Remotes registry file missing, creating default one in /root/.conan/remotes.json WARN: grpc/1.48.0: requirement re2/20220601 overridden by arrow/10.0.0 to re2/20220201 WARN: grpc/1.48.0: requirement protobuf/3.21.4 overridden by arrow/10.0.0 to protobuf/3.21.1 WARN: googleapis/cci.20220711: requirement protobuf/3.21.4 overridden by grpc/1.48.0 to protobuf/3.21.1 WARN: grpc-proto/cci.20220627: requirement protobuf/3.21.4 overridden by grpc/1.48.0 to protobuf/3.21.1 ERROR: Missing binary: grpc/1.48.0:ddc600b3316e16c4e38f2c1ca1214d7241b4dd80 grpc/1.48.0: WARN: Can't find a 'grpc/1.48.0' package for the specified settings, options and dependencies: - Settings: arch=x86_64, build_type=Release, compiler=gcc, compiler.libcxx=libstdc++, compiler.version=10, os=Linux - Options: codegen=True, cpp_plugin=True, csharp_ext=False, csharp_plugin=True, fPIC=True, node_plugin=True, objective_c_plugin=True, php_plugin=True, python_plugin=True, ruby_plugin=True, secure=False, shared=False, abseil:fPIC=True, abseil:shared=False, c-ares:fPIC=True, c-ares:shared=False, c-ares:tools=True, googleapis:fPIC=True, googleapis:shared=False, grpc-proto:fPIC=True, grpc-proto:shared=False, openssl:386=False, openssl:enable_weak_ssl_ciphers=False, openssl:fPIC=True, openssl:no_aria=False, openssl:no_asm=False, openssl:no_async=False, openssl:no_bf=False, openssl:no_blake2=False, openssl:no_camellia=False, openssl:no_cast=False, openssl:no_chacha=False, openssl:no_cms=False, openssl:no_comp=False, openssl:no_ct=False, openssl:no_deprecated=False, openssl:no_des=False, openssl:no_dgram=False, openssl:no_dh=False, openssl:no_dsa=False, openssl:no_dso=False, openssl:no_ec=False, openssl:no_ecdh=False, openssl:no_ecdsa=False, openssl:no_engine=False, openssl:no_filenames=False, openssl:no_gost=False, openssl:no_hmac=False, openssl:no_idea=False, openssl:no_md4=False, openssl:no_md5=False, openssl:no_mdc2=False, openssl:no_ocsp=False, openssl:no_pinshared=False, openssl:no_rc2=False, openssl:no_rfc3779=False, openssl:no_rmd160=False, openssl:no_rsa=False, openssl:no_seed=False, openssl:no_sha=False, openssl:no_sm2=False, openssl:no_sm3=False, openssl:no_sm4=False, openssl:no_sock=False, openssl:no_srp=False, openssl:no_srtp=False, openssl:no_sse2=False, openssl:no_ssl=False, openssl:no_ssl3=False, openssl:no_stdio=False, openssl:no_tests=False, openssl:no_threads=False, openssl:no_tls1=False, openssl:no_ts=False, openssl:no_whirlpool=False, openssl:openssldir=None, openssl:shared=False, protobuf:debug_suffix=True, protobuf:fPIC=True, protobuf:lite=False, protobuf:shared=False, protobuf:with_rtti=True, protobuf:with_zlib=True, re2:fPIC=True, re2:shared=False, zlib:fPIC=True, zlib:shared=False - Dependencies: abseil/20220623.0, c-ares/1.18.1, openssl/1.1.1q, re2/20220201, zlib/1.2.12, protobuf/3.21.1, googleapis/cci.20220711, grpc-proto/cci.20220627 - Requirements: abseil/20220623.Y.Z, c-ares/1.Y.Z, googleapis/cci.20220711, grpc-proto/cci.20220627, openssl/1.Y.Z, protobuf/3.21.1:37dd8aae630726607d9d4108fefd2f59c8f7e9db, re2/20220201.Y.Z, zlib/1.Y.Z - Package ID: ddc600b3316e16c4e38f2c1ca1214d7241b4dd80 ERROR: Missing prebuilt package for 'grpc/1.48.0' {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18091) [Ruby] Arrow::Table#join returns separated columns by key
Hirokazu SUZUKI created ARROW-18091: --- Summary: [Ruby] Arrow::Table#join returns separated columns by key Key: ARROW-18091 URL: https://issues.apache.org/jira/browse/ARROW-18091 Project: Apache Arrow Issue Type: Bug Components: Ruby Reporter: Hirokazu SUZUKI -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18090) Dictionary Style array for Keywords or Tags
Sven Cattell created ARROW-18090: Summary: Dictionary Style array for Keywords or Tags Key: ARROW-18090 URL: https://issues.apache.org/jira/browse/ARROW-18090 Project: Apache Arrow Issue Type: New Feature Reporter: Sven Cattell I want to efficiently encode lists of tags for each element in my database. In my case I have 30 tags, and a few are assigned to each of my ~20m records. Here's a simplified example of 5 records: * pe, keylogger, cryptojack * pe, packed * pe, cryptojack, c2 * pe, keylogger, c2 * pe Right now I have to store these in a List and have huge amounts of duplicate data. The dictionary array looks almost perfect for this task. I just want to allow for a List instead of just T for the allowed primitive index type in a dictionary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18089) [R] Cannot read_parquet on http URL
Neal Richardson created ARROW-18089: --- Summary: [R] Cannot read_parquet on http URL Key: ARROW-18089 URL: https://issues.apache.org/jira/browse/ARROW-18089 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Neal Richardson Fix For: 11.0.0 {code} u <- "https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet"; read_parquet(u) # Error: file must be a "RandomAccessFile" read_parquet(url(u)) # Error: file must be a "RandomAccessFile" {code} The issue is that urls get turned into InputStream by {{make_readable_file}}, and parquet requires RandomAccessFile. {code} arrow:::make_readable_file(u) # InputStream {code} There are two relevant codepaths in make_readable_file: if given a string URL, it tries {{FileSystem$from_uri()}} and falls back to {{MakeRConnectionInputStream}}, which returns InputStream not RandomAccessFile. If provided a connection object (i.e. {{url(u)}}), it tries MakeRConnectionRandomAccessFile first and falls back to MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to InputStream: {code} arrow:::MakeRConnectionRandomAccessFile(url(u)) # Error: Tell() returned an error {code} If we truly can't work with a HTTP URL in read_parquet, we should at least document that. We could also do the workaround of downloading to a tempfile first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18088) [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution
Joris Van den Bossche created ARROW-18088: - Summary: [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution Key: ARROW-18088 URL: https://issues.apache.org/jira/browse/ARROW-18088 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche The nightly python builds using the pandas development version are failing: https://github.com/ursacomputing/crossbow/actions/runs/3269767207/jobs/5377649455 Example failure: {code} test_parquet_2_0_roundtrip[None-True] _ tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_parquet_2_0_roundtrip_Non0') chunk_size = None, use_legacy_dataset = True @pytest.mark.pandas @parametrize_legacy_dataset @pytest.mark.parametrize('chunk_size', [None, 1000]) def test_parquet_2_0_roundtrip(tempdir, chunk_size, use_legacy_dataset): df = alltypes_sample(size=1, categorical=True) filename = tempdir / 'pandas_roundtrip.parquet' arrow_table = pa.Table.from_pandas(df) assert arrow_table.schema.pandas_metadata is not None _write_table(arrow_table, filename, version='2.6', coerce_timestamps='ms', chunk_size=chunk_size) table_read = pq.read_pandas( filename, use_legacy_dataset=use_legacy_dataset) assert table_read.schema.pandas_metadata is not None read_metadata = table_read.schema.metadata assert arrow_table.schema.metadata == read_metadata df_read = table_read.to_pandas() > tm.assert_frame_equal(df, df_read) E AssertionError: Attributes of DataFrame.iloc[:, 12] (column name="timedelta") are different E E Attribute "dtype" are different E [left]: timedelta64[s] E [right]: timedelta64[ns] opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/parquet/test_data_types.py:76: AssertionError {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18087) [C++] RecordBatch::Equals ignores field names
Joris Van den Bossche created ARROW-18087: - Summary: [C++] RecordBatch::Equals ignores field names Key: ARROW-18087 URL: https://issues.apache.org/jira/browse/ARROW-18087 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche The {{RecordBatch::Equals}} method only checks the equality of the schema of both batches if {{check_metadata=True}}, with a result that it doesn't actually check the schema (eg field names) by default. Python illustration: {code} In [3]: batch1 = pa.record_batch(pd.DataFrame({'a': [1, 2, 3]})) In [4]: batch2 = pa.record_batch(pd.DataFrame({'b': [1, 2, 3]})) In [5]: batch1.equals(batch2) Out[5]: True In [6]: batch1.equals(batch2, check_metadata=True) Out[6]: False {code} My expectation is that RecordBatch equality always requires equal field names (as Table::Equals does). And the {{check_metadata}} keyword should only control whether the metadata of the schema is considered (as the documentation also says), not whether the schema is checked at all. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-julia] palday opened a new issue, #345: Tests fail on Apple silicon on Julia 1.8
palday opened a new issue, #345: URL: https://github.com/apache/arrow-julia/issues/345 ```julia ArgumentError: unsafe_wrap: pointer 0x14858d048 is not properly aligned to 16 bytes Stacktrace: [1] #unsafe_wrap#102 @ ./pointer.jl:89 [inlined] [2] unsafe_wrap @ ./pointer.jl:87 [inlined] [3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}}, batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer, compression::Nothing) @ Arrow ~/Code/arrow-julia/src/table.jl:507 [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool) ``` Full test output ```julia (Arrow) pkg> test Testing Arrow Status `/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Project.toml` [69666777] Arrow v2.3.0 `~/Code/arrow-julia` ⌅ [31f734f8] ArrowTypes v1.2.1 [c3b6d118] BitIntegers v0.2.6 [324d7699] CategoricalArrays v0.10.7 [5ba52731] CodecLz4 v0.4.0 [6b39b394] CodecZstd v0.7.2 [9a962f9c] DataAPI v1.12.0 [48062228] FilePathsBase v0.9.20 [0f8b85d8] JSON3 v1.10.0 [2dfb63ee] PooledArrays v1.4.2 [91c51154] SentinelArrays v1.3.16 [856f2bd8] StructTypes v1.10.0 [bd369af6] Tables v1.10.0 [f269a46b] TimeZones v1.9.0 [76eceee3] WorkerUtilities v1.1.0 [ade2ca70] Dates `@stdlib/Dates` [a63ad114] Mmap `@stdlib/Mmap` [9a3f8284] Random `@stdlib/Random` [8dfed614] Test `@stdlib/Test` [cf7118a7] UUIDs `@stdlib/UUIDs` Status `/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Manifest.toml` [69666777] Arrow v2.3.0 `~/Code/arrow-julia` ⌅ [31f734f8] ArrowTypes v1.2.1 [c3b6d118] BitIntegers v0.2.6 [fa961155] CEnum v0.4.2 [324d7699] CategoricalArrays v0.10.7 [5ba52731] CodecLz4 v0.4.0 [6b39b394] CodecZstd v0.7.2 ⌅ [34da2185] Compat v3.46.0 [9a962f9c] DataAPI v1.12.0 [e2d170a0] DataValueInterfaces v1.0.0 [e2ba6199] ExprTools v0.1.8 [48062228] FilePathsBase v0.9.20 [842dd82b] InlineStrings v1.2.2 [82899510] IteratorInterfaceExtensions v1.0.0 [692b3bcd] JLLWrappers v1.4.1 [0f8b85d8] JSON3 v1.10.0 [e1d29d7a] Missings v1.0.2 [78c3b35d] Mocking v0.7.3 [bac558e1] OrderedCollections v1.4.1 [69de0a69] Parsers v2.4.2 [2dfb63ee] PooledArrays v1.4.2 [21216c6a] Preferences v1.3.0 [3cdcf5f2] RecipesBase v1.3.1 [ae029012] Requires v1.3.0 [6c6a2e73] Scratch v1.1.1 [91c51154] SentinelArrays v1.3.16 [66db9d55] SnoopPrecompile v1.0.1 [856f2bd8] StructTypes v1.10.0 [3783bdb8] TableTraits v1.0.1 [bd369af6] Tables v1.10.0 [f269a46b] TimeZones v1.9.0 [3bb67fe8] TranscodingStreams v0.9.9 [76eceee3] WorkerUtilities v1.1.0 [5ced341a] Lz4_jll v1.9.3+0 [3161d3a3] Zstd_jll v1.5.2+0 [0dad84c5] ArgTools v1.1.1 `@stdlib/ArgTools` [56f22d72] Artifacts `@stdlib/Artifacts` [2a0f44e3] Base64 `@stdlib/Base64` [ade2ca70] Dates `@stdlib/Dates` [8bb1440f] DelimitedFiles `@stdlib/DelimitedFiles` [8ba89e20] Distributed `@stdlib/Distributed` [f43a241f] Downloads v1.6.0 `@stdlib/Downloads` [7b1f6079] FileWatching `@stdlib/FileWatching` [9fa8497b] Future `@stdlib/Future` [b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils` [4af54fe1] LazyArtifacts `@stdlib/LazyArtifacts` [b27032c2] LibCURL v0.6.3 `@stdlib/LibCURL` [76f85450] LibGit2 `@stdlib/LibGit2` [8f399da3] Libdl `@stdlib/Libdl` [37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra` [56ddb016] Logging `@stdlib/Logging` [d6f4376e] Markdown `@stdlib/Markdown` [a63ad114] Mmap `@stdlib/Mmap` [ca575930] NetworkOptions v1.2.0 `@stdlib/NetworkOptions` [44cfe95a] Pkg v1.8.0 `@stdlib/Pkg` [de0858da] Printf `@stdlib/Printf` [3fa0cd96] REPL `@stdlib/REPL` [9a3f8284] Random `@stdlib/Random` [ea8e919c] SHA v0.7.0 `@stdlib/SHA` [9e88b42a] Serialization `@stdlib/Serialization` [1a1011a3] SharedArrays `@stdlib/SharedArrays` [6462fe0b] Sockets `@stdlib/Sockets` [2f01184e] SparseArrays `@stdlib/SparseArrays` [10745b16] Statistics `@stdlib/Statistics` [fa267f1f] TOML v1.0.0 `@stdlib/TOML` [a4e569a6] Tar v1.10.1 `@stdlib/Tar` [8dfed614] Test `@stdlib/Test` [cf7118a7] UUIDs `@stdlib/UUIDs` [4ec0a83e] Unicode `@stdlib/Unicode` [e66e0078] CompilerSupportLibraries_jll v0.5.2+0 `@stdlib/CompilerSupportLibraries_jll` [deac9b47] LibCURL_jll v7.84.0+0 `@stdlib/LibCURL_jll` [29816b5a] LibSSH2_jll v1.10.2+0 `@stdlib/LibSSH2_jll` [c8ffd9c3] MbedTLS_jll v2.28.0+0 `@stdlib/MbedTLS_jll` [14a3606d] MozillaCACerts_jll v2022.2.1 `@stdlib/MozillaCACerts_jll` [4536629a] OpenBLAS_jll v0.3.20+0 `@stdlib/O
[jira] [Created] (ARROW-18086) In Red Arrow, importing table containing float16 array throws error
Atte Keinänen created ARROW-18086: - Summary: In Red Arrow, importing table containing float16 array throws error Key: ARROW-18086 URL: https://issues.apache.org/jira/browse/ARROW-18086 Project: Apache Arrow Issue Type: Bug Components: Ruby Affects Versions: 9.0.0 Reporter: Atte Keinänen Assignee: Kouhei Sutou In Red Arrow, loading table containing float16 array leads to this error when using IPC streaming format: {code:java} > Arrow::Table.load(Arrow::Buffer.new(resp.body), format: :arrow_streaming) cannot create instance of abstract (non-instantiatable) type 'GArrowDataType' from /usr/local/bundle/gems/gobject-introspection-4.0.3/lib/gobject-introspection/loader.rb:688:in `invoke' from /usr/local/bundle/gems/gobject-introspection-4.0.3/lib/gobject-introspection/loader.rb:559:in `get_field'{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18085) [Dev][Archery][Crossbow] Comment report bot uses the wrong URL if task run has not started
Raúl Cumplido created ARROW-18085: - Summary: [Dev][Archery][Crossbow] Comment report bot uses the wrong URL if task run has not started Key: ARROW-18085 URL: https://issues.apache.org/jira/browse/ARROW-18085 Project: Apache Arrow Issue Type: Bug Components: Archery, Continuous Integration Reporter: Raúl Cumplido Fix For: 11.0.0 As discussed on this comment: [https://github.com/apache/arrow/pull/14446#issuecomment-1282067185] Sometimes the task URL that we use on the report is not correct because the job run has not yet started on GitHub forcing us to wait and if not found using the branch URL. On those cases we should use the URL we used to use before ARROW-18028 was merged: https://issues.apache.org/jira/browse/ARROW-18028 [https://github.com/apache/arrow/commit/1e481b5d6dc6537e1994a4ff03334e95c7cfca93] On the case of GitHub: {code:java} https://github.com/{repo}/actions?query=branch:{branch} {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18084) "CSV parser got out of sync with chunker" on subsequent batches regardless of block size
Juan Luis Cano Rodríguez created ARROW-18084: Summary: "CSV parser got out of sync with chunker" on subsequent batches regardless of block size Key: ARROW-18084 URL: https://issues.apache.org/jira/browse/ARROW-18084 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 9.0.0, 7.0.0 Environment: Ubuntu Linux pyarrow 9.0.0 installed with pip (manylinux wheel) Python 3.9.0 from conda-forge GCC 9.4.0 Reporter: Juan Luis Cano Rodríguez Attachments: Screenshot 2022-10-18 at 10-11-29 JupyterLab · Orchest.png I'm trying to read a specific large CSV file (`the-reddit-climate-change-dataset-comments.csv` from [this dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset]) by batches. This is my code: {code:python} import os import pyarrow as pa from pyarrow.csv import open_csv, ReadOptions import pyarrow.parquet as pq filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv" print(f"Reading {filename}...") mmap = pa.memory_map(filename) reader = open_csv(mmap) while True: try: batch = reader.read_next_batch() print(len(batch)) except StopIteration: break {code} But, after a few batches, I get an exception: {noformat} Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv... 1233 1279 1293 --- ArrowInvalid Traceback (most recent call last) Input In [1], in () 13 while True: 14 try: ---> 15 batch = reader.read_next_batch() 16 print(len(batch)) 17 except StopIteration: File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in pyarrow.lib.RecordBatchReader.read_next_batch() File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: CSV parser got out of sync with chunker {noformat} I have tried changing the block size, but I always end up with that error sooner or later: - With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 11 rows and then crashes - With 100_000, 103 rows and then crashes - 1_000_000: 1164 rows and then crashes - 10_000_000: 12370 rows and then crashes I am not sure what else to try here. According to [the C++ source code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267], this "should not happen". I have tried with pyarrow 7.0 and 9.0, identical result and traceback. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18083) [C++] Bump vendored
Antoine Pitrou created ARROW-18083: -- Summary: [C++] Bump vendored Key: ARROW-18083 URL: https://issues.apache.org/jira/browse/ARROW-18083 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Fix For: 10.0.0 ZLib recently released version 1.2.13, which includes a security fix. We should bump the vendored version before 10.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)