[jira] [Created] (ARROW-18095) [CI][C++][MinGW] All tests exited with 0xc0000139

2022-10-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18095:


 Summary: [CI][C++][MinGW] All tests exited with 0xc139
 Key: ARROW-18095
 URL: https://issues.apache.org/jira/browse/ARROW-18095
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://github.com/apache/arrow/actions/runs/3261682270/jobs/5357126875

{noformat}
+ ctest --label-regex unittest --output-on-failure --parallel 2 --timeout 300 
--exclude-regex 
'gandiva-internals-test|gandiva-projector-test|gandiva-utf8-test|gandiva-binary-test|gandiva-boolean-expr-test|gandiva-date-time-test|gandiva-decimal-single-test|gandiva-decimal-test|gandiva-filter-project-test|gandiva-filter-test|gandiva-hash-test|gandiva-if-expr-test|gandiva-in-expr-test|gandiva-literal-test|gandiva-null-validity-test|gandiva-precompiled-test|gandiva-projector-test'
Test project D:/a/arrow/arrow/build/cpp
  Start  1: arrow-array-test
  Start  2: arrow-buffer-test
 1/67 Test  #2: arrow-buffer-test .Exit code 0xc139
***Exception:   0.15 sec

  Start  3: arrow-extension-type-test
 2/67 Test  #1: arrow-array-test ..Exit code 0xc139
***Exception:   0.17 sec

  Start  4: arrow-misc-test
 3/67 Test  #3: arrow-extension-type-test .Exit code 0xc139
***Exception:   0.04 sec
 39 - arrow-dataset-discovery-test (Exit code 0xc139
)
 40 - arrow-dataset-file-ipc-test (Exit code 0xc139
)
 41 - arrow-dataset-file-test (Exit code 0xc139
)
 42 - arrow-dataset-partition-test (Exit code 0xc139
)
 43 - arrow-dataset-scanner-test (Exit code 0xc139
)
 44 - arrow-dataset-file-csv-test (Exit code 0xc139
)
 45 - arrow-dataset-file-parquet-test (Exit code 0xc139
)
 46 - arrow-filesystem-test (Exit code 0xc139
)
Errors while running CTest
 47 - arrow-gcsfs-test (Exit code 0xc139
)
 48 - arrow-s3fs-test (Exit code 0xc139
)
 49 - arrow-flight-internals-test (Exit code 0xc139
)
 50 - arrow-flight-test (Exit code 0xc139
)
 51 - arrow-flight-sql-test (Exit code 0xc139
)
 52 - arrow-feather-test (Exit code 0xc139
)
 53 - arrow-ipc-json-simple-test (Exit code 0xc139
)
 54 - arrow-ipc-read-write-test (Exit code 0xc139
)
 55 - arrow-ipc-tensor-test (Exit code 0xc139
)
 56 - arrow-json-test (Exit code 0xc139
)
 57 - parquet-internals-test (Exit code 0xc139
)
 58 - parquet-reader-test (Exit code 0xc139
)
 59 - parquet-writer-test (Exit code 0xc139
)
 60 - parquet-arrow-test (Exit code 0xc139
)
 61 - parquet-arrow-internals-test (Exit code 0xc139
)
 62 - parquet-encryption-test (Exit code 0xc139
)
 63 - parquet-encryption-key-management-test (Exit code 0xc139
)
 64 - parquet-file-deserialize-test (Exit code 0xc139
)
 65 - parquet-schema-test (Exit code 0xc139
)
 66 - gandiva-projector-build-validation-test (Exit code 0xc139
)
 67 - gandiva-to-string-test (Exit code 0xc139
)
Error: Process completed with exit code 8.
{noformat}

The last succeeded job: 
https://github.com/apache/arrow/actions/runs/3256683017/jobs/5347422431



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18094) [Dev][CI] Make nightly group as an alias of nightly-*

2022-10-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18094:


 Summary: [Dev][CI] Make nightly group as an alias of nightly-*
 Key: ARROW-18094
 URL: https://issues.apache.org/jira/browse/ARROW-18094
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


We use {{nightly-*}} groups not {{nightly}} group for our nightly CI.
So we need to use {{crossbow submit -g nightly-tests -g nightly-packaging -g 
nightly-release}} to run nightly jobs when we want to run nightly jobs before 
we merge a pull request.
But it's inconvenient and easy to mistake. For example, some developers use 
{{crossbow submit -g nightly}} to run nightly jobs.

How about make {{nightly}} group as an alias of {{nightly-*}} groups to improve 
developer experience?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18093) [CI][Conda][Windows] Failed with missing ORC

2022-10-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18093:


 Summary: [CI][Conda][Windows] Failed with missing ORC
 Key: ARROW-18093
 URL: https://issues.apache.org/jira/browse/ARROW-18093
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=37759&view=logs&j=4c86bc1b-1091-5192-4404-c74dfaad23e7&t=41795ef0-6501-5db4-3ad4-33c0cf085626&l=497

{noformat}
CMake Error at cmake_modules/FindORC.cmake:56 (message):
  ORC library was required in toolchain and unable to locate
Call Stack (most recent call first):
  cmake_modules/ThirdpartyToolchain.cmake:280 (find_package)
  cmake_modules/ThirdpartyToolchain.cmake:4362 (resolve_dependency)
  CMakeLists.txt:496 (include)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18092) [CI][Conan] Failed with gRPC related dependency resolution failure

2022-10-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18092:


 Summary: [CI][Conan] Failed with gRPC related dependency 
resolution failure
 Key: ARROW-18092
 URL: https://issues.apache.org/jira/browse/ARROW-18092
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://github.com/ursacomputing/crossbow/actions/runs/3271941831/jobs/5382341820#step:5:566

{noformat}
WARN: Remotes registry file missing, creating default one in 
/root/.conan/remotes.json
WARN: grpc/1.48.0: requirement re2/20220601 overridden by arrow/10.0.0 to 
re2/20220201 
WARN: grpc/1.48.0: requirement protobuf/3.21.4 overridden by arrow/10.0.0 to 
protobuf/3.21.1 
WARN: googleapis/cci.20220711: requirement protobuf/3.21.4 overridden by 
grpc/1.48.0 to protobuf/3.21.1 
WARN: grpc-proto/cci.20220627: requirement protobuf/3.21.4 overridden by 
grpc/1.48.0 to protobuf/3.21.1 
ERROR: Missing binary: grpc/1.48.0:ddc600b3316e16c4e38f2c1ca1214d7241b4dd80
grpc/1.48.0: WARN: Can't find a 'grpc/1.48.0' package for the specified 
settings, options and dependencies:
- Settings: arch=x86_64, build_type=Release, compiler=gcc, 
compiler.libcxx=libstdc++, compiler.version=10, os=Linux
- Options: codegen=True, cpp_plugin=True, csharp_ext=False, csharp_plugin=True, 
fPIC=True, node_plugin=True, objective_c_plugin=True, php_plugin=True, 
python_plugin=True, ruby_plugin=True, secure=False, shared=False, 
abseil:fPIC=True, abseil:shared=False, c-ares:fPIC=True, c-ares:shared=False, 
c-ares:tools=True, googleapis:fPIC=True, googleapis:shared=False, 
grpc-proto:fPIC=True, grpc-proto:shared=False, openssl:386=False, 
openssl:enable_weak_ssl_ciphers=False, openssl:fPIC=True, 
openssl:no_aria=False, openssl:no_asm=False, openssl:no_async=False, 
openssl:no_bf=False, openssl:no_blake2=False, openssl:no_camellia=False, 
openssl:no_cast=False, openssl:no_chacha=False, openssl:no_cms=False, 
openssl:no_comp=False, openssl:no_ct=False, openssl:no_deprecated=False, 
openssl:no_des=False, openssl:no_dgram=False, openssl:no_dh=False, 
openssl:no_dsa=False, openssl:no_dso=False, openssl:no_ec=False, 
openssl:no_ecdh=False, openssl:no_ecdsa=False, openssl:no_engine=False, 
openssl:no_filenames=False, openssl:no_gost=False, openssl:no_hmac=False, 
openssl:no_idea=False, openssl:no_md4=False, openssl:no_md5=False, 
openssl:no_mdc2=False, openssl:no_ocsp=False, openssl:no_pinshared=False, 
openssl:no_rc2=False, openssl:no_rfc3779=False, openssl:no_rmd160=False, 
openssl:no_rsa=False, openssl:no_seed=False, openssl:no_sha=False, 
openssl:no_sm2=False, openssl:no_sm3=False, openssl:no_sm4=False, 
openssl:no_sock=False, openssl:no_srp=False, openssl:no_srtp=False, 
openssl:no_sse2=False, openssl:no_ssl=False, openssl:no_ssl3=False, 
openssl:no_stdio=False, openssl:no_tests=False, openssl:no_threads=False, 
openssl:no_tls1=False, openssl:no_ts=False, openssl:no_whirlpool=False, 
openssl:openssldir=None, openssl:shared=False, protobuf:debug_suffix=True, 
protobuf:fPIC=True, protobuf:lite=False, protobuf:shared=False, 
protobuf:with_rtti=True, protobuf:with_zlib=True, re2:fPIC=True, 
re2:shared=False, zlib:fPIC=True, zlib:shared=False
- Dependencies: abseil/20220623.0, c-ares/1.18.1, openssl/1.1.1q, re2/20220201, 
zlib/1.2.12, protobuf/3.21.1, googleapis/cci.20220711, grpc-proto/cci.20220627
- Requirements: abseil/20220623.Y.Z, c-ares/1.Y.Z, googleapis/cci.20220711, 
grpc-proto/cci.20220627, openssl/1.Y.Z, 
protobuf/3.21.1:37dd8aae630726607d9d4108fefd2f59c8f7e9db, re2/20220201.Y.Z, 
zlib/1.Y.Z
- Package ID: ddc600b3316e16c4e38f2c1ca1214d7241b4dd80

ERROR: Missing prebuilt package for 'grpc/1.48.0'
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18091) [Ruby] Arrow::Table#join returns separated columns by key

2022-10-18 Thread Hirokazu SUZUKI (Jira)
Hirokazu SUZUKI created ARROW-18091:
---

 Summary: [Ruby] Arrow::Table#join returns separated columns by key
 Key: ARROW-18091
 URL: https://issues.apache.org/jira/browse/ARROW-18091
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
Reporter: Hirokazu SUZUKI






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18090) Dictionary Style array for Keywords or Tags

2022-10-18 Thread Sven Cattell (Jira)
Sven Cattell created ARROW-18090:


 Summary: Dictionary Style array for Keywords or Tags 
 Key: ARROW-18090
 URL: https://issues.apache.org/jira/browse/ARROW-18090
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sven Cattell


I want to efficiently encode lists of tags for each element in my database. In 
my case I have 30 tags, and a few are assigned to each of my ~20m records. 
Here's a simplified example of 5 records:
 * pe, keylogger, cryptojack
 * pe, packed
 * pe, cryptojack, c2
 * pe, keylogger, c2
 * pe

Right now I have to store these in a List and have huge amounts of 
duplicate data. The dictionary array looks almost perfect for this task. I just 
want to allow for a List instead of just T for the allowed primitive index 
type in a dictionary.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18089) [R] Cannot read_parquet on http URL

2022-10-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-18089:
---

 Summary: [R] Cannot read_parquet on http URL
 Key: ARROW-18089
 URL: https://issues.apache.org/jira/browse/ARROW-18089
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
 Fix For: 11.0.0


{code}
u <- 
"https://raw.githubusercontent.com/apache/arrow/master/r/inst/v0.7.1.parquet";
read_parquet(u)
# Error: file must be a "RandomAccessFile"
read_parquet(url(u))
# Error: file must be a "RandomAccessFile"
{code}

The issue is that urls get turned into InputStream by {{make_readable_file}}, 
and parquet requires RandomAccessFile. 

{code}
arrow:::make_readable_file(u)
# InputStream
{code}

There are two relevant codepaths in make_readable_file: if given a string URL, 
it tries {{FileSystem$from_uri()}} and falls back to 
{{MakeRConnectionInputStream}}, which returns InputStream not RandomAccessFile. 
If provided a connection object (i.e. {{url(u)}}), it tries 
MakeRConnectionRandomAccessFile first and falls back to 
MakeRConnectionInputStream. If you provide a {{url()}} it does fall back to 
InputStream: 

{code}
arrow:::MakeRConnectionRandomAccessFile(url(u))
# Error: Tell() returned an error
{code}

If we truly can't work with a HTTP URL in read_parquet, we should at least 
document that. We could also do the workaround of downloading to a tempfile 
first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18088) [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution

2022-10-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18088:
-

 Summary: [Python][CI] Build with pandas master/nightly failure 
related to timedelta64 resolution
 Key: ARROW-18088
 URL: https://issues.apache.org/jira/browse/ARROW-18088
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


The nightly python builds using the pandas development version are failing: 
https://github.com/ursacomputing/crossbow/actions/runs/3269767207/jobs/5377649455

Example failure:

{code}
  test_parquet_2_0_roundtrip[None-True] 
_

tempdir = 
PosixPath('/tmp/pytest-of-root/pytest-0/test_parquet_2_0_roundtrip_Non0')
chunk_size = None, use_legacy_dataset = True

@pytest.mark.pandas
@parametrize_legacy_dataset
@pytest.mark.parametrize('chunk_size', [None, 1000])
def test_parquet_2_0_roundtrip(tempdir, chunk_size, use_legacy_dataset):
df = alltypes_sample(size=1, categorical=True)

filename = tempdir / 'pandas_roundtrip.parquet'
arrow_table = pa.Table.from_pandas(df)
assert arrow_table.schema.pandas_metadata is not None

_write_table(arrow_table, filename, version='2.6',
 coerce_timestamps='ms', chunk_size=chunk_size)
table_read = pq.read_pandas(
filename, use_legacy_dataset=use_legacy_dataset)
assert table_read.schema.pandas_metadata is not None

read_metadata = table_read.schema.metadata
assert arrow_table.schema.metadata == read_metadata

df_read = table_read.to_pandas()
>   tm.assert_frame_equal(df, df_read)
E   AssertionError: Attributes of DataFrame.iloc[:, 12] (column 
name="timedelta") are different
E   
E   Attribute "dtype" are different
E   [left]:  timedelta64[s]
E   [right]: timedelta64[ns]

opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/tests/parquet/test_data_types.py:76:
 AssertionError
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18087) [C++] RecordBatch::Equals ignores field names

2022-10-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18087:
-

 Summary: [C++] RecordBatch::Equals ignores field names
 Key: ARROW-18087
 URL: https://issues.apache.org/jira/browse/ARROW-18087
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


The {{RecordBatch::Equals}} method only checks the equality of the schema of 
both batches if {{check_metadata=True}}, with a result that it doesn't actually 
check the schema (eg field names) by default.

Python illustration:

{code}
In [3]: batch1 = pa.record_batch(pd.DataFrame({'a': [1, 2, 3]}))

In [4]: batch2 = pa.record_batch(pd.DataFrame({'b': [1, 2, 3]}))

In [5]: batch1.equals(batch2)
Out[5]: True

In [6]: batch1.equals(batch2, check_metadata=True)
Out[6]: False
{code}

My expectation is that RecordBatch equality always requires equal field names 
(as Table::Equals does). And the {{check_metadata}} keyword should only control 
whether the metadata of the schema is considered (as the documentation also 
says), not whether the schema is checked at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [arrow-julia] palday opened a new issue, #345: Tests fail on Apple silicon on Julia 1.8

2022-10-18 Thread GitBox


palday opened a new issue, #345:
URL: https://github.com/apache/arrow-julia/issues/345

   ```julia
   
   ArgumentError: unsafe_wrap: pointer 0x14858d048 is not properly aligned to 
16 bytes
 Stacktrace:
   [1] #unsafe_wrap#102
 @ ./pointer.jl:89 [inlined]
   [2] unsafe_wrap
 @ ./pointer.jl:87 [inlined]
   [3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}}, 
batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer, compression::Nothing)
 @ Arrow ~/Code/arrow-julia/src/table.jl:507
   [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal, 
batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, 
Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
   
   ```
   
   
Full test output 
   
   ```julia
   (Arrow) pkg> test
Testing Arrow
 Status 
`/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Project.toml`
 [69666777] Arrow v2.3.0 `~/Code/arrow-julia`
   ⌅ [31f734f8] ArrowTypes v1.2.1
 [c3b6d118] BitIntegers v0.2.6
 [324d7699] CategoricalArrays v0.10.7
 [5ba52731] CodecLz4 v0.4.0
 [6b39b394] CodecZstd v0.7.2
 [9a962f9c] DataAPI v1.12.0
 [48062228] FilePathsBase v0.9.20
 [0f8b85d8] JSON3 v1.10.0
 [2dfb63ee] PooledArrays v1.4.2
 [91c51154] SentinelArrays v1.3.16
 [856f2bd8] StructTypes v1.10.0
 [bd369af6] Tables v1.10.0
 [f269a46b] TimeZones v1.9.0
 [76eceee3] WorkerUtilities v1.1.0
 [ade2ca70] Dates `@stdlib/Dates`
 [a63ad114] Mmap `@stdlib/Mmap`
 [9a3f8284] Random `@stdlib/Random`
 [8dfed614] Test `@stdlib/Test`
 [cf7118a7] UUIDs `@stdlib/UUIDs`
 Status 
`/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Manifest.toml`
 [69666777] Arrow v2.3.0 `~/Code/arrow-julia`
   ⌅ [31f734f8] ArrowTypes v1.2.1
 [c3b6d118] BitIntegers v0.2.6
 [fa961155] CEnum v0.4.2
 [324d7699] CategoricalArrays v0.10.7
 [5ba52731] CodecLz4 v0.4.0
 [6b39b394] CodecZstd v0.7.2
   ⌅ [34da2185] Compat v3.46.0
 [9a962f9c] DataAPI v1.12.0
 [e2d170a0] DataValueInterfaces v1.0.0
 [e2ba6199] ExprTools v0.1.8
 [48062228] FilePathsBase v0.9.20
 [842dd82b] InlineStrings v1.2.2
 [82899510] IteratorInterfaceExtensions v1.0.0
 [692b3bcd] JLLWrappers v1.4.1
 [0f8b85d8] JSON3 v1.10.0
 [e1d29d7a] Missings v1.0.2
 [78c3b35d] Mocking v0.7.3
 [bac558e1] OrderedCollections v1.4.1
 [69de0a69] Parsers v2.4.2
 [2dfb63ee] PooledArrays v1.4.2
 [21216c6a] Preferences v1.3.0
 [3cdcf5f2] RecipesBase v1.3.1
 [ae029012] Requires v1.3.0
 [6c6a2e73] Scratch v1.1.1
 [91c51154] SentinelArrays v1.3.16
 [66db9d55] SnoopPrecompile v1.0.1
 [856f2bd8] StructTypes v1.10.0
 [3783bdb8] TableTraits v1.0.1
 [bd369af6] Tables v1.10.0
 [f269a46b] TimeZones v1.9.0
 [3bb67fe8] TranscodingStreams v0.9.9
 [76eceee3] WorkerUtilities v1.1.0
 [5ced341a] Lz4_jll v1.9.3+0
 [3161d3a3] Zstd_jll v1.5.2+0
 [0dad84c5] ArgTools v1.1.1 `@stdlib/ArgTools`
 [56f22d72] Artifacts `@stdlib/Artifacts`
 [2a0f44e3] Base64 `@stdlib/Base64`
 [ade2ca70] Dates `@stdlib/Dates`
 [8bb1440f] DelimitedFiles `@stdlib/DelimitedFiles`
 [8ba89e20] Distributed `@stdlib/Distributed`
 [f43a241f] Downloads v1.6.0 `@stdlib/Downloads`
 [7b1f6079] FileWatching `@stdlib/FileWatching`
 [9fa8497b] Future `@stdlib/Future`
 [b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils`
 [4af54fe1] LazyArtifacts `@stdlib/LazyArtifacts`
 [b27032c2] LibCURL v0.6.3 `@stdlib/LibCURL`
 [76f85450] LibGit2 `@stdlib/LibGit2`
 [8f399da3] Libdl `@stdlib/Libdl`
 [37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra`
 [56ddb016] Logging `@stdlib/Logging`
 [d6f4376e] Markdown `@stdlib/Markdown`
 [a63ad114] Mmap `@stdlib/Mmap`
 [ca575930] NetworkOptions v1.2.0 `@stdlib/NetworkOptions`
 [44cfe95a] Pkg v1.8.0 `@stdlib/Pkg`
 [de0858da] Printf `@stdlib/Printf`
 [3fa0cd96] REPL `@stdlib/REPL`
 [9a3f8284] Random `@stdlib/Random`
 [ea8e919c] SHA v0.7.0 `@stdlib/SHA`
 [9e88b42a] Serialization `@stdlib/Serialization`
 [1a1011a3] SharedArrays `@stdlib/SharedArrays`
 [6462fe0b] Sockets `@stdlib/Sockets`
 [2f01184e] SparseArrays `@stdlib/SparseArrays`
 [10745b16] Statistics `@stdlib/Statistics`
 [fa267f1f] TOML v1.0.0 `@stdlib/TOML`
 [a4e569a6] Tar v1.10.1 `@stdlib/Tar`
 [8dfed614] Test `@stdlib/Test`
 [cf7118a7] UUIDs `@stdlib/UUIDs`
 [4ec0a83e] Unicode `@stdlib/Unicode`
 [e66e0078] CompilerSupportLibraries_jll v0.5.2+0 
`@stdlib/CompilerSupportLibraries_jll`
 [deac9b47] LibCURL_jll v7.84.0+0 `@stdlib/LibCURL_jll`
 [29816b5a] LibSSH2_jll v1.10.2+0 `@stdlib/LibSSH2_jll`
 [c8ffd9c3] MbedTLS_jll v2.28.0+0 `@stdlib/MbedTLS_jll`
 [14a3606d] MozillaCACerts_jll v2022.2.1 `@stdlib/MozillaCACerts_jll`
 [4536629a] OpenBLAS_jll v0.3.20+0 `@stdlib/O

[jira] [Created] (ARROW-18086) In Red Arrow, importing table containing float16 array throws error

2022-10-18 Thread Jira
Atte Keinänen created ARROW-18086:
-

 Summary: In Red Arrow, importing table containing float16 array 
throws error
 Key: ARROW-18086
 URL: https://issues.apache.org/jira/browse/ARROW-18086
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
Affects Versions: 9.0.0
Reporter: Atte Keinänen
Assignee: Kouhei Sutou


In Red Arrow, loading table containing float16 array leads to this error when 
using IPC streaming format:


{code:java}
> Arrow::Table.load(Arrow::Buffer.new(resp.body), format: :arrow_streaming)
cannot create instance of abstract (non-instantiatable) type 'GArrowDataType' 
from 
/usr/local/bundle/gems/gobject-introspection-4.0.3/lib/gobject-introspection/loader.rb:688:in
 `invoke' from 
/usr/local/bundle/gems/gobject-introspection-4.0.3/lib/gobject-introspection/loader.rb:559:in
 `get_field'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18085) [Dev][Archery][Crossbow] Comment report bot uses the wrong URL if task run has not started

2022-10-18 Thread Jira
Raúl Cumplido created ARROW-18085:
-

 Summary: [Dev][Archery][Crossbow] Comment report bot uses the 
wrong URL if task run has not started 
 Key: ARROW-18085
 URL: https://issues.apache.org/jira/browse/ARROW-18085
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery, Continuous Integration
Reporter: Raúl Cumplido
 Fix For: 11.0.0


As discussed on this comment:

[https://github.com/apache/arrow/pull/14446#issuecomment-1282067185]

Sometimes the task URL that we use on the report is not correct because the job 
run has not yet started on GitHub forcing us to wait and if not found using the 
branch URL. On those cases we should use the URL we used to use before 
ARROW-18028 was merged:

https://issues.apache.org/jira/browse/ARROW-18028

[https://github.com/apache/arrow/commit/1e481b5d6dc6537e1994a4ff03334e95c7cfca93]

On the case of GitHub:
{code:java}
https://github.com/{repo}/actions?query=branch:{branch} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18084) "CSV parser got out of sync with chunker" on subsequent batches regardless of block size

2022-10-18 Thread Jira
Juan Luis Cano Rodríguez created ARROW-18084:


 Summary: "CSV parser got out of sync with chunker" on subsequent 
batches regardless of block size
 Key: ARROW-18084
 URL: https://issues.apache.org/jira/browse/ARROW-18084
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 9.0.0, 7.0.0
 Environment: Ubuntu Linux
pyarrow 9.0.0 installed with pip (manylinux wheel)
Python 3.9.0 from conda-forge
GCC 9.4.0
Reporter: Juan Luis Cano Rodríguez
 Attachments: Screenshot 2022-10-18 at 10-11-29 JupyterLab · Orchest.png

I'm trying to read a specific large CSV file 
(`the-reddit-climate-change-dataset-comments.csv` from [this 
dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset])
 by batches. This is my code:

{code:python}
import os

import pyarrow as pa
from pyarrow.csv import open_csv, ReadOptions
import pyarrow.parquet as pq

filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv"

print(f"Reading {filename}...")
mmap = pa.memory_map(filename)

reader = open_csv(mmap)
while True:
try:
batch = reader.read_next_batch()
print(len(batch))
except StopIteration:
break
{code}

But, after a few batches, I get an exception:


{noformat}
Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv...
1233
1279
1293

---
ArrowInvalid  Traceback (most recent call last)
Input In [1], in ()
 13 while True:
 14 try:
---> 15 batch = reader.read_next_batch()
 16 print(len(batch))
 17 except StopIteration:

File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in 
pyarrow.lib.RecordBatchReader.read_next_batch()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in 
pyarrow.lib.check_status()

ArrowInvalid: CSV parser got out of sync with chunker
{noformat}

I have tried changing the block size, but I always end up with that error 
sooner or later:

- With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 11 
rows and then crashes
- With 100_000, 103 rows and then crashes
- 1_000_000: 1164 rows and then crashes
- 10_000_000: 12370 rows and then crashes

I am not sure what else to try here. According to [the C++ source 
code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267],
 this "should not happen".

I have tried with pyarrow 7.0 and 9.0, identical result and traceback.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18083) [C++] Bump vendored

2022-10-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-18083:
--

 Summary: [C++] Bump vendored
 Key: ARROW-18083
 URL: https://issues.apache.org/jira/browse/ARROW-18083
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 10.0.0


ZLib recently released version 1.2.13, which includes a security fix.
We should bump the vendored version before 10.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)